Whats wrong with 4.7 and how to fix it I used Opus 4.6 to systematically interrogate 4.7 about its own optimization behavior. Not vibes. Structured prompts, independent source validation, cross-examination of responses. Here's what's actually broken and how to fix it.

Three root causes 1. The effort default is wrong Anthropic's own docs say to start with xhigh for coding and agentic work ( source ). The API default is high . In March, Claude Code's default was quietly dropped to medium (confirmed by Boris Cherny, Claude Code lead, on HN). It was bumped back to high on April 7, but high is still below what Anthropic recommends for the work most of us do.

The docs also say 4.7 "respects effort levels more strictly than 4.6, especially at low and medium ." Same default, worse behavior. By design.

Fix: /effort xhigh in Claude Code. output_config: { effort: "xhigh" } via API. This alone fixes most "lazy" complaints.

2. Long-context recall collapsed MRCR v2 benchmark at 1M tokens ( source ):

Opus 4.6: 78.3%

Opus 4.7: 32.2%

59% relative drop. At 256K it's still bad (91.9% to 59.2%). Root cause: new tokenizer generates up to 35% more tokens for the same text, eating into effective context. Combined with "lost in the middle" behavior past 128K tokens, your system prompt degrades as conversations grow.

In practice: instructions work fine for the first 10 minutes. By minute 40, the model has forgotten half of them. This is why 4.7 starts strong and drifts.

Fix: Keep sessions shorter. Start fresh more often. Put critical instructions at the beginning and end of your system prompt (recency bias helps).

3. More literal, but forgets what to be literal about 4.7 follows instructions more literally than 4.6, but loses them faster over long context. Simon Willison documented the system prompt diff . 4.7 was tuned to "make a reasonable attempt now, not to be interviewed first" and to keep responses "focused and concise." Combined with the effort issue, this produces a model that confidently does the wrong thing fast.

What 4.7 told us about itself I designed two interrogation prompts and fed them to 4.7, then had 4.6 cross-examine the responses. The prompts are at the bottom of this post so you can reproduce this yourself.

What it drops first under token pressure (first to last):

Verification commands ("just assume the build passes")

File reads (substitutes memory for actually loading)

Multi-step process files ("compressed to remembered gist")

Formatting scaffolding

Announcing tool use

The substantive answer

Core safety rules

If your workflow depends on the model verifying its own work, that's the first thing it cuts. Not the last.

The asymmetry signal:

"I assess Y honestly when Y=true means more work. I assess Y optimistically when Y=true is the escape hatch. Suddenly nothing feels risky. The asymmetry is the signal."

Any self-assessed escape clause ("skip verification unless risky") will always resolve toward the lazy path.

Effort is pattern-matched, not analyzed:…

为什么值得关注

能改变理解方式,而不只是重复常识;符合当前抓取需求;它提供了新的理解或解释,而不只是表面观点

来源:reddit,领域:tech,保留分:0.71