Check.AI

深度评测 · 2026-05-12 · by

Claude Opus 4.7 Review: SWE-bench 87.6%, Same Price, Who Should Upgrade

Claude Opus 4.7, which Anthropic shipped on 2026-04-16, is the most substantial Claude release of the past year. SWE-bench Pro jumped 10.9 points in a single version, the hallucination rate dropped from 61% to 36%, high-resolution image support arrived, and the price held steady (though a new tokenizer means a hidden 0-35% increase). This review uses Anthropic's official docs, hands-on testing from Vellum, and Artificial Analysis data to put Opus 4.7 next to 4.6, GPT-5.4, Gemini 3.1 Pro, and Sonnet 4.6. By the end you'll know whether to upgrade, how to upgrade, and which cases actually argue against it.

30-second verdict

Compare every model live on Check.AI →

Core specs

Item Claude Opus 4.7
API model IDclaude-opus-4-7
Release date2026-04-16
Context window1,000,000 tokens
Max output128,000 tokens
Input price$5.00 / million tokens
Output price$25.00 / million tokens
Cache hitinput price × 0.1 (90% off)
Batch APIhalf price on input/output
High-resolution images2576px / 3.75MP (previous gen 1568px / 1.15MP)
AvailabilityAnthropic API, AWS Bedrock, Azure, Google Vertex

Key benchmarks vs the last gen and rivals

Benchmark Opus 4.7 Opus 4.6 GPT-5.4 Gemini 3.1 Pro
SWE-bench Verified (agent coding)87.6%80.8%N/A80.6%
SWE-bench Pro (harder)64.3%53.4%57.7%54.2%
Terminal-Bench 2.0 (CLI)69.4%65.4%82.7% (GPT-5.5)68.5%
MCP-Atlas (multi-tool calling)77.3%75.8%68.1%73.9%
Finance Agent v1.164.4%60.1%61.5% (Pro)59.7%
OSWorld-Verified (computer use)78.0%72.7%75.0%N/A
BrowseComp (web research)79.3%83.7%89.3% (Pro)85.9%
GPQA Diamond (scientific reasoning)94.2%91.3%94.4% (Pro)94.3%
CharXiv (visual reasoning)82.1%69.1%N/AN/A
Hallucination rate (lower is better)36%61%N/AN/A

Data from Anthropic's official docs, Vellum evaluations, and the Artificial Analysis Intelligence Index, current as of May 2026. "N/A" means that source did not publish the figure. GPT-5.4 Pro is OpenAI's higher-effort version, at a higher price.

5 changes that actually matter

1. SWE-bench Pro +10.9 points: the agent-coding inflection point

SWE-bench Verified has been pushed past 80% to the point that nobody cares. SWE-bench Pro is the real agent-coding benchmark for 2026: harder, demands multi-step planning, requires cross-file coordination. The 10.9-point jump from Opus 4.6 to 4.7 (53.4% to 64.3%) is the largest single-version gain across all frontier models in the past year, leaving GPT-5.4's 57.7% and Gemini 3.1 Pro's 54.2% well behind.

What it means in practice: where Claude Code used to land a large refactor on the first try about 60% of the time, it's now 75%+. One fewer retry pays for the upgrade.

2. Hallucination rate cut from 61% to 36%

This is the most dramatic number Anthropic published. On the same test suite, Opus 4.6 hallucinated 61% of the time; 4.7 only 36%. The mechanism is that the model is more willing to say "I don't know" rather than make something up. For production that matters most where a wrong answer costs more than no answer: automated support, legal RAG, medical assistance. For those, 4.7 is a mandatory upgrade.

3. High-resolution image support (computer use is finally usable)

The image ceiling rose from 1568px / 1.15MP to 2576px / 3.75MP. Coordinates now map 1:1 to pixels, so no scale-factor conversion. That's a step change for three cases:

4. New tokenizer: your bill could rise 0-35%

The price sheet still says $5/$25, but the same Chinese text, code, or data now uses 1.0 to 1.35x the tokens on 4.7. In other words:

Best practice: before upgrading, run 100-500 of your real requests and measure the bill change yourself. Don't take "the price is unchanged" at face value.

5. xhigh effort + task budgets (new tools for agent workflows)

Anthropic added an xhigh effort level (harder-working than high, spends more tokens but is steadier). There's also a new task_budget beta header that gives an agent a total token budget to allocate itself. The model can see the countdown, so it prioritizes and wraps up on time.

It doesn't mean much for indie developers, but it's a step change for enterprise agent workflows (CI/CD integration, automated PR review).

3 breaking API changes to read before upgrading

  1. Extended thinking is gone. Setting thinking: {"type": "enabled", "budget_tokens": N} returns a 400. Use thinking: {"type": "adaptive"} + effort: "high" instead.
  2. temperature / top_p / top_k are all gone. Setting a non-default value returns a 400. Control behavior through the prompt.
  3. Thinking content isn't returned by default. Products that stream the reasoning process in the UI will see long blank stretches. You have to explicitly turn on display: "summarized".

Adaptive thinking is also off by default: set nothing and it won't think at all. That's the biggest behavioral difference from 4.6. Claude Code, Cursor, and Cline have already updated; if you wrote your own SDK integration, you'll need to change it.

Who should upgrade, who shouldn't, who can skip

🟢 Upgrade

🟡 Worth upgrading, but A/B test first

🔴 Don't bother

Real cost estimate (same workload)

Assume a code-review agent handling 500 PRs a month, each averaging 40K tokens in, 4K tokens out, and 3 tool calls.

Model Monthly cost SWE-bench Pro Recommendation
Opus 4.7~$15064.3%Critical PRs + complex refactors
Opus 4.6~$13053.4%No reason to keep it, upgrade to 4.7
Sonnet 4.6~$90~50%Routine PRs, the value pick
GPT-5.4~$7557.7%CLI / terminal tasks
DeepSeek R1~$15~52%Cost-sensitive batch

Estimates for reference only, before prompt cache and batch discounts. Heavy cache reuse can lower Opus 4.7's real cost by 40-60%.

What to watch over the next 6 months

FAQ

When was Opus 4.7 released? April 16, 2026. API ID claude-opus-4-7.

Did the price change? Not on the surface ($5/$25), but the new tokenizer uses 1-1.35x the tokens for Chinese/code, so your real bill could rise 0-35%.

Do I have to upgrade? For agent coding / computer use / RAG, yes. For low-value batch and CLI-heavy work, no.

How does it compare to GPT-5.4? Opus is stronger on SWE-bench Pro (64.3% vs 57.7%); GPT is stronger on BrowseComp (89.3% vs 79.3%). On GPQA the three are nearly tied.

Does upgrading require code changes? Yes. Extended thinking budget and temperature/top_p/top_k are all gone, and thinking content isn't returned by default.

Does the 1M context cost extra? No. The 1M context is standard pricing, with no long-context premium.

→ Compare Opus 4.7 vs other models live on Check.AI

Sources