深度评测 · 2026-05-12 · by @zayuerweb-dev
Claude Opus 4.7 Review: SWE-bench 87.6%, Same Price, Who Should Upgrade
Claude Opus 4.7, which Anthropic shipped on 2026-04-16, is the most substantial Claude release of the past year. SWE-bench Pro jumped 10.9 points in a single version, the hallucination rate dropped from 61% to 36%, high-resolution image support arrived, and the price held steady (though a new tokenizer means a hidden 0-35% increase). This review uses Anthropic's official docs, hands-on testing from Vellum, and Artificial Analysis data to put Opus 4.7 next to 4.6, GPT-5.4, Gemini 3.1 Pro, and Sonnet 4.6. By the end you'll know whether to upgrade, how to upgrade, and which cases actually argue against it.
30-second verdict
- Agent coding / long agentic loops: upgrade. SWE-bench Pro 64.3% is first in the industry.
- Computer use / screenshot understanding: upgrade. OSWorld 78%, 2576px high-resolution image support.
- Knowledge work (docs, slides, charts): upgrade. CharXiv vision 82.1% (+13 points).
- Web research / long tool chains: not necessary. GPT-5.4 Pro still leads on BrowseComp at 89.3%.
- Terminal coding (CLI-heavy): not necessary. GPT-5.5 hits 82.7% on Terminal-Bench, well above Opus 4.7's 69.4%.
- Cost-sensitive batch jobs: do not. At $5/$25 it's 9x DeepSeek R1 and 1.67x Sonnet 4.6.
- When in doubt: Opus 4.7 for the hard tasks, Sonnet 4.6 for routine calls, DeepSeek R1 for batch.
Core specs
| Item | Claude Opus 4.7 |
|---|---|
| API model ID | claude-opus-4-7 |
| Release date | 2026-04-16 |
| Context window | 1,000,000 tokens |
| Max output | 128,000 tokens |
| Input price | $5.00 / million tokens |
| Output price | $25.00 / million tokens |
| Cache hit | input price × 0.1 (90% off) |
| Batch API | half price on input/output |
| High-resolution images | 2576px / 3.75MP (previous gen 1568px / 1.15MP) |
| Availability | Anthropic API, AWS Bedrock, Azure, Google Vertex |
Key benchmarks vs the last gen and rivals
| Benchmark | Opus 4.7 | Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified (agent coding) | 87.6% | 80.8% | N/A | 80.6% |
| SWE-bench Pro (harder) | 64.3% | 53.4% | 57.7% | 54.2% |
| Terminal-Bench 2.0 (CLI) | 69.4% | 65.4% | 82.7% (GPT-5.5) | 68.5% |
| MCP-Atlas (multi-tool calling) | 77.3% | 75.8% | 68.1% | 73.9% |
| Finance Agent v1.1 | 64.4% | 60.1% | 61.5% (Pro) | 59.7% |
| OSWorld-Verified (computer use) | 78.0% | 72.7% | 75.0% | N/A |
| BrowseComp (web research) | 79.3% | 83.7% | 89.3% (Pro) | 85.9% |
| GPQA Diamond (scientific reasoning) | 94.2% | 91.3% | 94.4% (Pro) | 94.3% |
| CharXiv (visual reasoning) | 82.1% | 69.1% | N/A | N/A |
| Hallucination rate (lower is better) | 36% | 61% | N/A | N/A |
Data from Anthropic's official docs, Vellum evaluations, and the Artificial Analysis Intelligence Index, current as of May 2026. "N/A" means that source did not publish the figure. GPT-5.4 Pro is OpenAI's higher-effort version, at a higher price.
5 changes that actually matter
1. SWE-bench Pro +10.9 points: the agent-coding inflection point
SWE-bench Verified has been pushed past 80% to the point that nobody cares. SWE-bench Pro is the real agent-coding benchmark for 2026: harder, demands multi-step planning, requires cross-file coordination. The 10.9-point jump from Opus 4.6 to 4.7 (53.4% to 64.3%) is the largest single-version gain across all frontier models in the past year, leaving GPT-5.4's 57.7% and Gemini 3.1 Pro's 54.2% well behind.
What it means in practice: where Claude Code used to land a large refactor on the first try about 60% of the time, it's now 75%+. One fewer retry pays for the upgrade.
2. Hallucination rate cut from 61% to 36%
This is the most dramatic number Anthropic published. On the same test suite, Opus 4.6 hallucinated 61% of the time; 4.7 only 36%. The mechanism is that the model is more willing to say "I don't know" rather than make something up. For production that matters most where a wrong answer costs more than no answer: automated support, legal RAG, medical assistance. For those, 4.7 is a mandatory upgrade.
3. High-resolution image support (computer use is finally usable)
The image ceiling rose from 1568px / 1.15MP to 2576px / 3.75MP. Coordinates now map 1:1 to pixels, so no scale-factor conversion. That's a step change for three cases:
- Computer use: full-screen captures aren't blurry, and button targeting is far more accurate.
- Document / form understanding: scanned PDFs and contract screenshots are much more readable.
- Artifact / chart analysis: CharXiv vision rose from 69.1% to 82.1% (+13).
4. New tokenizer: your bill could rise 0-35%
The price sheet still says $5/$25, but the same Chinese text, code, or data now uses 1.0 to 1.35x the tokens on 4.7. In other words:
- Plain short English: essentially no difference.
- Chinese, code, data: possibly 20-35% more.
- The real effect may be offset by Opus 4.7's smaller output (35% fewer output tokens on the same Artificial Analysis benchmark suite).
Best practice: before upgrading, run 100-500 of your real requests and measure the bill change yourself. Don't take "the price is unchanged" at face value.
5. xhigh effort + task budgets (new tools for agent workflows)
Anthropic added an xhigh effort level (harder-working than high, spends more tokens but is steadier). There's also a new task_budget beta header that gives an agent a total token budget to allocate itself. The model can see the countdown, so it prioritizes and wraps up on time.
It doesn't mean much for indie developers, but it's a step change for enterprise agent workflows (CI/CD integration, automated PR review).
3 breaking API changes to read before upgrading
- Extended thinking is gone. Setting
thinking: {"type": "enabled", "budget_tokens": N}returns a 400. Usethinking: {"type": "adaptive"}+effort: "high"instead. - temperature / top_p / top_k are all gone. Setting a non-default value returns a 400. Control behavior through the prompt.
- Thinking content isn't returned by default. Products that stream the reasoning process in the UI will see long blank stretches. You have to explicitly turn on
display: "summarized".
Adaptive thinking is also off by default: set nothing and it won't think at all. That's the biggest behavioral difference from 4.6. Claude Code, Cursor, and Cline have already updated; if you wrote your own SDK integration, you'll need to change it.
Who should upgrade, who shouldn't, who can skip
🟢 Upgrade
- Using Opus 4.6 for Claude Code, agent coding, long agentic loops.
- Running computer use, screenshot understanding, document extraction.
- Doing RAG / support where you'd rather not answer than answer wrong.
- Using a multi-tool agent (MCP-Atlas 77.3%, first in the industry).
🟡 Worth upgrading, but A/B test first
- On Sonnet 4.6 and wanting a quality bump: Opus is 1.67x the price, so check whether your task complexity justifies it.
- Web research / multi-search apps: GPT-5.4 Pro still leads BrowseComp at 89.3%.
- Chinese-heavy traffic: the tokenizer change adds +20-35% for Chinese, so run the numbers.
🔴 Don't bother
- Using DeepSeek R1 / Qwen3 / GLM-4.6 for cost-sensitive batch: Opus is 5-10x their price.
- Pure terminal CLI heavy use: GPT-5.5 leads Terminal-Bench by a wide margin at 82.7%.
- Already on GPT-5.4 Pro for web research / deep search: same generation, no reason to switch.
Real cost estimate (same workload)
Assume a code-review agent handling 500 PRs a month, each averaging 40K tokens in, 4K tokens out, and 3 tool calls.
| Model | Monthly cost | SWE-bench Pro | Recommendation |
|---|---|---|---|
| Opus 4.7 | ~$150 | 64.3% | Critical PRs + complex refactors |
| Opus 4.6 | ~$130 | 53.4% | No reason to keep it, upgrade to 4.7 |
| Sonnet 4.6 | ~$90 | ~50% | Routine PRs, the value pick |
| GPT-5.4 | ~$75 | 57.7% | CLI / terminal tasks |
| DeepSeek R1 | ~$15 | ~52% | Cost-sensitive batch |
Estimates for reference only, before prompt cache and batch discounts. Heavy cache reuse can lower Opus 4.7's real cost by 40-60%.
What to watch over the next 6 months
- When Sonnet 4.7 arrives. The historical pattern: a Sonnet version follows an Opus release by 2-4 months. Expected Q3 2026.
- Whether Gemini 3.5 / GPT-6 overtake it. All three have clustered above 80% on SWE-bench Verified; the next jump comes down to who breaks 90% first.
- The price war. DeepSeek R2 is expected in Q3 and could widen the 1:9 value gap again.
- Whether task budget / xhigh become an industry standard. If OpenAI and Google follow, agent workflows will standardize around them.
- Whether the tokenizer "hidden hike" becomes the new normal. Sticker price unchanged but more tokens used: other vendors may copy it.
Related reading
- The Complete Guide to Running Open-Source LLMs Locally 2026
- RAG vs Long Context vs Fine-tune 2026: What to Pick When
- GPT-5 vs Claude Sonnet 4.6: Which to Pick for Coding
- DeepSeek R1 vs GPT-5: How Many Times Cheaper, Really
- The 2026 Chinese AI Model Landscape
- Best AI Models for Coding
- Long-Context AI Models Compared
FAQ
When was Opus 4.7 released? April 16, 2026. API ID claude-opus-4-7.
Did the price change? Not on the surface ($5/$25), but the new tokenizer uses 1-1.35x the tokens for Chinese/code, so your real bill could rise 0-35%.
Do I have to upgrade? For agent coding / computer use / RAG, yes. For low-value batch and CLI-heavy work, no.
How does it compare to GPT-5.4? Opus is stronger on SWE-bench Pro (64.3% vs 57.7%); GPT is stronger on BrowseComp (89.3% vs 79.3%). On GPQA the three are nearly tied.
Does upgrading require code changes? Yes. Extended thinking budget and temperature/top_p/top_k are all gone, and thinking content isn't returned by default.
Does the 1M context cost extra? No. The 1M context is standard pricing, with no long-context premium.
→ Compare Opus 4.7 vs other models live on Check.AI
Sources
- Claude Opus 4.7 official release and specs · 2026-05-22
- SWE-bench / benchmark comparison · 2026-05-22
- Vellum hands-on testing · 2026-05-22