Claude Opus 4.7: A Measured Assessment of Anthropic's Latest Model
Claude Opus 4.7: A Measured Assessment of Anthropic's Latest Model
🔑 Key Takeaways
-
Meaningful but not transformative upgrade: Opus 4.7 delivers 10% gains on SWE Bench Pro and 13% on visual reasoning—solid improvements, but not the revolutionary leap initial benchmarks suggest. The new tokenizer (1–1.35x token inflation) and intentional security guardrails offset some gains in real-world usage.
-
Benchmarking may be masking prior degradation: Multiple independent analyses found Opus 4.6 was deliberately degraded in February–March, with reasoning depth collapsing 73% (2,200 → 600 characters) and tool-calling accuracy tanking after March 8. Opus 4.7's headline performance may reflect recovery to intended baselines rather than genuine breakthrough capability.
-
Vision is the clear winner: Resolution tripled (1,568 → 2,576 pixels), visual reasoning jumped from 69% to 82%, and the model handles dense documents, PDFs, and screenshots substantially better. This is the upgrade most users will notice immediately.
-
Token costs are sneakily higher: Pricing per token unchanged ($5 in / $25 out), but the new tokenizer means identical prompts cost 20–60% more. High-resolution images now consume ~4,700 tokens vs. ~1,600 before. Budget forecasts and cost caps need immediate rebaselining.
-
Intentional capability holdbacks on agentic features: Agentic search/browse performance actually decreased vs. 4.6; cyber security scores dropped slightly. These are deliberate guardrails tied to Anthropic's internal safety concerns. If you need unrestricted browsing or security research, apply for the cyber verification program.
Benchmarks and Real-World Performance Gap
Official benchmarks show Opus 4.7 leading most categories, but independent testing reveals a critical context: Opus 4.6 experienced documented performance collapses in late February–early March. AMD's AI director analyzed 7,000 Claude Code sessions and found:
- Reasoning depth fell 73% (2,200 → 600 characters per reasoning turn)
- Code reads before editing dropped from 6.6× to 2.0× per task
- Tool-call violations spiked from zero to 10 per day
- Same tasks required 80× more API calls for equivalent output
This degradation was never publicly acknowledged by Anthropic, but was widely observed by power users. When Opus 4.7 launched, users reported "it feels like we just got the model back"—suggesting 4.7 may be a return to intended performance rather than a breakthrough.
Practical implication: If you're comparing 4.7 to a recently degraded 4.6, the gap is smaller than benchmarks claim. Compare 4.7 to a healthy 4.6 from January, and gains narrow further.
The Hidden Cost Structure
Tokenizer Inflation
- Input cost per token: unchanged ($5 per million)
- Token consumption: up 20–60% for identical prompts (documented range: 1.0x to 1.35x, with English-heavy text hitting ~59% inflation in some reports)
- Effective price increase: 20–60% for the same work
Action: Rebaseline cost-per-task estimates and monitor actual spend on representative workloads before full migration.
Vision Token Costs
- High-resolution images: ~4,700 tokens each (up from ~1,600)
- Use case: Only necessary for dense documents, medical imaging, code-heavy screenshots
- Optimization: Downsample images to 1,568px if reading simple text fields or button labels
Vision and Coding: The Real Upgrades
Vision (Legitimate improvement)
- Jump from 69.1% → 82.1% on visual reasoning benchmark
- 3× pixel resolution increase
- Real-world gains: PDFs with dense diagrams, technical documentation, UI extraction
- Limitation: Still struggles with low-resolution screenshots (~54% accuracy on simple charts)
Coding (Conditional)
- SWE Bench Pro: 53.4% → 64.3% (+10.9 points)
- Real production bug resolution (Rakuten SWE): 3× task completion rate
- Instruction following: More literal (breaking change—see behavior shifts below)
- Agentic terminal tasks: 65.4% → 69.4% (+4 points only; smaller due to security guardrails)
Trade-off: These gains come with new tokenizer costs and behavioral strictness that may require prompt refactoring.
Intentional Capability Holdbacks
Anthropic deliberately constrained Opus 4.7 in two areas tied to their safety stance:
| Feature | 4.6 | 4.7 | Why |
|---|---|---|---|
| Agentic Web Browse | 79.3% | Decreased | Unrestricted web navigation = cyber risk |
| Cyber Security Vulns | Higher | Lower | Intentional guardrails; apply for cyber verification program for unrestricted access |
Mythos Preview (unreleased) shows capability at 72% for hacking Firefox; Opus 4.7 sits below 2%. This gap is engineered, not a limitation of the architecture.
Behavioral Changes (Prompt Breaking)
These are not improvements—they are changes that may break existing workflows:
- Shorter answers on simple queries: 4.7 de-inflates response length. Prompts relying on 4.6's verbosity may feel curt.
- Literal instruction following: 4.6 would generalize ("do X for A" → do X for A–D helpfully). 4.7 does exactly what you say, no more.
- Fewer sub-agents by default: Prefers reasoning → fewer parallel agent spawns without explicit prompting.
- Fewer tool calls: Same logic—reasons before invoking tools.
- More direct tone: Fewer emojis, less validation. Matters for customer-facing applications.
Action: Audit high-value prompts, especially those assuming generalization. Bump effort level to high/extra-high if you relied on heavy sub-agent delegation.
Adaptive Thinking: A Hidden Breaking Change
- Old model (4.6): Extended thinking with fixed token budget (e.g., 2,000 tokens reserved upfront)
- New model (4.7): Adaptive thinking (model decides what each task needs; no upfront budget)
- The gotcha: Reasoning content is now omitted from responses by default
If your product streams the "thinking" section to users, they will see silence during the model's reasoning phase after upgrade unless you explicitly opt back in using the display parameter.
Action: Add this to your migration checklist.
Context Window Performance
One benchmark suggests a real concern: needle-in-haystack (MRCR) performance degraded at both 256K and 1M token context windows vs. 4.6. While this benchmark is somewhat artificial, Anthropic's response (deprecating MRCR) feels like avoidance rather than explanation. Practical long-context work (actual multi-round conversations, extended file analysis) seems unaffected in user reports, but retrieval accuracy in dense contexts may have tradeoffs.
Mythos: The Elephant in the Room
Anthropic released Mythos Preview (their most capable model) only to closed enterprise access, citing cyber security concerns. In benchmarks, Mythos dominates Opus 4.7 across almost every dimension: - Hacking/cyber: 72% vs. <2% for Opus 4.7 - General coding/reasoning: 5–15 point gaps in most domains
Speculation: Mythos may be real, or it may be a marketing lever to justify Opus 4.7 pricing while signaling "we're being careful." The system card provides detailed internal discussions of safety concerns, but the decision to withhold Mythos entirely is unusual and unconfirmed.
Usage Limits and Pricing Strategy
- New tokenizer + Anthropic's reset of session limits: Gave subscribers more quota to compensate for token inflation, masking the true cost increase
- Fast mode: Still exclusive to Opus 4.6 on Claude Code (likely to roll out for 4.7 within 1–4 weeks)
- API pricing: $5 in / $25 out (unchanged per token, but per-token costs have risen due to tokenizer)
Power users report $6,000+/month API spend even on non-intensive tasks, suggesting the token-inflation problem is real at scale.
Bottom Line
Upgrade if: - Vision is central to your workflow (82% vs. 69% is real) - You need better coding performance (SWE Bench +10.9 points is meaningful) - You have budget for higher token costs - Your prompts are explicit and don't rely on model generalization
Hold off if: - You rely on agentic web browsing (performance decreased) - Your workflows assume Opus 4.6's verbose, generalizing style - You need cyber security research capabilities without verification - You haven't rebaselined token costs for the new tokenizer
Overall assessment: Opus 4.7 is better than Opus 4.6, but the gap is narrower than headlines suggest when you account for 4.6's prior degradation and 4.7's tokenizer inflation. Vision gains are real and worth upgrading for; coding gains are meaningful but not transformative. The real question is whether Anthropic's safety constraints and cost structure represent genuine progress or a strategic pricing adjustment masked by feature claims.