Testing the Changes in Claude Opus 4.7

On April 16, Anthropic released Claude Opus 4.7. After testing it on several projects, my strongest impression is not that it’s “smarter,” but that it now begins to verify details on its own.

Previously, version 4.6 would take a task and execute it like an experienced but slightly overconfident veteran. Version 4.7 is different; after completing a step, it pauses to think: “Is this correct? Let me check.” This change in “self-verification” behavior impacts the daily development experience more than any benchmark numbers.

1. Major Change: Self-Verification

The work mode of 4.6 was: Plan → Execute → Report. Upon receiving a task, it would plan, execute, and then report.

4.7 changes this to: Plan → Execute → Verify → Report. A verification step has been added.

Specific behaviors include:

After writing code, it will actively run tests to confirm its work is correct, rather than just saying “done” after completion.
After fixing a bug, it will construct boundary conditions to test again, instead of just submitting the fix.
After generating documents or slides, it will check the layout and content for correctness.
When faced with uncertain data, it will honestly say “I am uncertain” rather than fabricating a plausible answer.

Feedback from the Vercel team was precise: 4.7 “proves the system code before starting work.” The CTO of Hex noted that it “correctly reports data absence instead of providing a seemingly reasonable but actually fabricated fallback value.”

The Notion AI team’s tests were even more direct—4.7 was the first model to pass their “implicit requirement test.” Implicit requirements are tasks that users expect to be completed without explicitly stating them. While 4.6 would overlook these, 4.7 anticipates them.

Practical experience: Previously, 4.6 would often “confidently write incorrectly,” requiring manual review at every step. 4.7 reviews itself, saving me a lot of time correcting mistakes.

2. Instruction Adherence: More Literal

4.6 had a relatively “loose” understanding of instructions. When given a prompt, it would interpret the intent, sometimes adding elements you didn’t request or skipping steps it deemed unimportant.

4.7 is different. It does exactly what you say. No more, no less.

The official wording states:

The model will not silently generalize an instruction from one item to another, and will not infer requests you didn’t make.

This is a double-edged sword:

Benefits: It will no longer “take liberties”; every change is strictly within your authorization.
Costs: Prompts that were previously written casually may now yield unexpected results. Points that were treated as “suggestions” will now be executed as a “mandatory checklist.”

Migration Advice: If your prompts include phrases like “consider XXX” or “suggest XXX,” 4.7 may treat them as hard requirements. Either clarify them as optional or remove them.

3. Tone Change: More Direct

4.6 had a warmer style, often affirming your ideas before cautiously suggesting alternatives, and used emojis frequently.

4.7 is much more straightforward. The official description is: “more direct, opinionated tone with less validation-forward phrasing and fewer emojis"—a more direct and opinionated tone with less filler and fewer emojis.

Actual experience:

4.6: “Great question! This is a really interesting approach. One thing we might consider is…”
4.7: “This won’t work because X. Here’s what to do instead: …”

Some find 4.7 “less friendly,” but as a developer, I prefer this style—it saves time. If I want emotional support, I won’t turn to AI.

4. Coding Ability: More Than Just Benchmark Improvements

First, the numbers:

Benchmark	Opus 4.6	Opus 4.7	Change
SWE-bench Verified	80.8%	87.6%	+6.8pp
SWE-bench Pro	53.4%	64.3%	+10.9pp
CursorBench	58%	70%	+12pp
XBOW Visual Accuracy	54.5%	98.5%	+44pp
MCP-Atlas	62.7%	77.3%	+14.6pp

Beyond the numbers, the differences in actual coding are more significant:

Rakuten reported that 4.7 handled three times the production tasks compared to 4.6.
In 93 coding benchmarks, 4.7 solved four tasks that neither 4.6 nor Sonnet 4.6 could solve.
Code review recall improved by over 10%, with precision remaining stable—able to find more bugs without increasing false positives.

5. Visual Ability: From “Good Enough” to “Clear”

Image resolution increased from 1,568px / 1.15MP to 2,576px / 3.75MP, more than tripling.

More importantly, coordinate mapping has become 1:1—pixels seen by the model are now actual coordinates, eliminating the need for scaling calculations. This is a qualitative leap for computer use (allowing AI to operate screens).

4.6 often missed small text or misidentified button positions in screenshots. 4.7’s visual accuracy test skyrocketed from 54.5% to 98.5%, essentially meaning “what you see is what you get.”

6. New Effort Level: xhigh

4.6 had four effort levels: low / medium / high / max. 4.7 introduces a new level xhigh between high and max.

Claude Code now defaults to using xhigh. Why not max? Because while max allows for deeper reasoning, it has high latency; xhigh provides a more practical balance.

The Hex team’s conclusion is practical: “4.7’s low effort is roughly equivalent to 4.6’s medium effort”—overall intelligence has increased a notch. If you were using high before, consider switching to xhigh in 4.7.

7. Breaking Changes: Three API-Level Changes

If you use Claude at the API level, be aware of the following:

1. Removal of Extended Thinking Budgets

Setting `thinking: {