Claude Opus 4.7: An Honest Review — What It Is, What It Means, and What Anthropic Isn't Telling You

Claude Opus 4.7: 11-point SWE-bench jump, 3x better vision. Plus: Mythos Preview exists and Anthropic won't release it. Full honest benchmark breakdown.

Anthropic released Claude Opus 4.7 today, April 16, 2026. The fourth Opus upgrade in six months. The benchmarks are impressive. The improvements are real. But the most interesting story is not the model itself — it is what it reveals about where Anthropic is headed, what they are holding back, and what this means for everyone building on AI right now.

I use Claude every day across multiple projects — from software development and data analysis to content creation and business automation. I have spent the last two days deep in Anthropic's ecosystem. Here is my honest take on Opus 4.7, with no hype and no disclaimers.

What Opus 4.7 Actually Is

Opus 4.7 is a direct upgrade to Opus 4.6, which launched in February 2026. It is not a new product tier — it replaces 4.6 as the default Opus model. Same pricing: $5 per million input tokens, $25 per million output tokens. Same 1 million token context window. Same availability across AWS Bedrock, Google Vertex AI, and Microsoft Foundry.

Anthropic's release cadence is now roughly every two months: Opus 4.1, 4.5 (November 2025), 4.6 (February 2026), and now 4.7 (April 2026). That is fast by any standard.

The Numbers That Matter

Coding — The Headline Improvement

SWE-bench Pro (resolving real GitHub issues): 64.3%, up from 53.4% on Opus 4.6. That is an 11-point jump. For context, GPT-5.4 scores 57.7% and Gemini 3.1 Pro hits 54.2%. Opus 4.7 leads the field on the benchmark that matters most to developers.

CursorBench: 70%, up from 58% on Opus 4.6. A 12-point jump on the benchmark that measures how the model performs in the tool developers actually use daily.

Rakuten-SWE-Bench: 3x more tasks resolved than Opus 4.6, with double-digit gains in code quality and test quality. Hex's 93-task coding benchmark: 13% improvement, including four tasks that neither Opus 4.6 nor Sonnet 4.6 could solve at all.

Vision — The Surprise Upgrade

Maximum image resolution jumped from 1.15 megapixels to 3.75 megapixels — a 3x increase. Visual acuity went from 54.5% to 98.5% on internal benchmarks. The model can now read small text in screenshots, precisely extract table data from scanned documents, and map coordinates 1:1 with actual pixels for computer use. If you are building screen automation or document processing, this changes what is possible.

Reasoning — Quietly Strong

GPQA Diamond (graduate-level reasoning): 94.2%. arXiv Reasoning with tools: 91.0%, up from 84.7% on Opus 4.6. GDPVal-AA (finance and legal knowledge work): Elo score of 1753, beating GPT-5.4 (1674) and Gemini 3.1 Pro (1314) by a wide margin. BigLaw Bench accuracy: 90.9%.

Where It Does Not Lead

Opus 4.7 is not the best at everything. GPT-5.4 still leads in agentic search (89.3% vs 79.3%) and some multilingual benchmarks. Gemini 3.1 Pro has a 2 million token context window — double Opus 4.7's million. And Anthropic's own Mythos Preview beats Opus 4.7 across the board.

The Real Story: Reliability Over Raw Intelligence

Here is what the benchmarks do not capture and what early testers consistently report: the biggest improvement is not in what Opus 4.7 can do — it is in how reliably it does it.

Opus 4.6 was smart but unreliable on complex tasks. Users on GitHub were vocal — an AMD senior director wrote in a widely-shared post that Claude had regressed to the point it could not be trusted to perform complex engineering. Whether that was an actual regression or unmet expectations, the perception was real.

Opus 4.7 addresses this head-on. The model catches its own logical faults during planning. It self-verifies outputs before reporting back. It continues executing through tool failures that would have stopped Opus 4.6 dead. Multiple early testers describe it as the first version you can hand off hard work to without babysitting.

Hex quantified it: low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6. Warp reported it passed tasks that prior Claude models had failed and worked through a concurrency bug that Opus 4.6 could not crack.

In any production workflow that depends on AI, reliability is worth more than marginal intelligence gains. A model that is 5% smarter but fails unpredictably costs more than a model that is consistently good.

The Pricing Trap Nobody Is Talking About

Anthropic announced that pricing remains the same as Opus 4.6. That is technically true and practically misleading.

Opus 4.7 uses a new tokenizer. The same input text can produce 1.0 to 1.35x more tokens depending on content type. That means your identical prompts could cost up to 35% more in token consumption — even though the price per token has not changed. On top of that, Opus 4.7 thinks more at higher effort levels, generating more output tokens on complex tasks.

For casual users on Claude Pro ($20/month), this does not matter — you have a usage allocation, not per-token billing. For developers on the API making thousands of calls daily, this is a meaningful cost increase that Anthropic buried in the fine print.

The saving grace: prompt caching (up to 90% savings) and batch processing (50% savings) still apply. Anthropic also introduced task budgets in public beta — hard ceilings on token spend for autonomous agents. Its existence tells you something about the model's tendency to think more than expected.

The Mythos Shadow

In the benchmark chart accompanying the Opus 4.7 announcement, Anthropic included Claude Mythos Preview. Mythos beats Opus 4.7 on every single benchmark. SWE-bench Pro: 77.8% (Mythos) vs 64.3% (Opus 4.7). That is not a small gap.

Anthropic publicly admitted — in their own launch blog — that their generally available model is significantly less capable than a model they already built but will not release to the public.

Mythos Preview is restricted to 11 companies through Project Glasswing (Apple, Google, Microsoft among them). The reason: Mythos can find and exploit critical software vulnerabilities in major operating systems and web browsers at a level that rivals skilled human security researchers. Anthropic decided the cybersecurity risk of broad release was too high.

Opus 4.7 had its cybersecurity capabilities deliberately reduced during training — Anthropic called it differential capability reduction. This is unprecedented transparency from an AI company. Most competitors would simply not mention that a better model exists. Anthropic put Mythos in the comparison chart alongside the model they are actually selling. That is either radical honesty or strategic positioning — probably both.

The implication: the frontier of AI capability is further ahead than what is publicly available. What you can use today is a deliberately constrained version of what exists. That gap will only grow as models get more powerful and safety considerations get more complex.

What This Means for Developers

If you are building on the API: The breaking changes are real. Sampling parameters (temperature, top_p, top_k) are removed. Extended thinking budgets are different. The model follows instructions more literally than 4.6. Test before you deploy. Do not just swap claude-opus-4-6 for claude-opus-4-7 and push to production. The new xhigh effort level is a sweet spot between quality and cost that did not exist before.

If you use Claude Code: Opus 4.7 is already the default. You get a new /ultrareview command that simulates a senior human code reviewer, and Auto Mode is now available for Max plan users. Less back-and-forth on complex tasks — the model plans better and handles multi-step workflows with less supervision.

If you are an everyday user: You will notice better image understanding and more thorough responses on complex questions. For casual use — writing, brainstorming, simple Q&A — the difference from 4.6 is minimal.

What This Means for the AI Industry

The two-month upgrade cadence is relentless. Anthropic has delivered meaningful improvements to Opus every two months since November 2025. Fast enough to make it difficult for competitors to establish durable leads.

The capability-safety tension is becoming explicit. With Mythos, Anthropic has a model that is clearly better but too dangerous to release broadly. This is the first time a major AI company has so publicly grappled with this tradeoff. It will not be the last.

Same price, higher cost will become a pattern. The tokenizer change that increases token consumption while keeping per-token pricing flat is a template other companies will copy. Watch for this in every model update going forward.

The Bottom Line

Opus 4.7 is a genuine improvement over 4.6. The reliability gains alone justify the upgrade for anyone doing serious development work. The vision improvements open new use cases. The self-verification capability means less human supervision on complex tasks.

But the most important thing about this release is not the model — it is the signal. Anthropic is telling us, publicly, that they have something significantly better that they will not release because the safety implications are not resolved. They are deliberately constraining capability in specific domains while advancing it in others.

Whether you think that is responsible caution or competitive strategy, it is the template for how AI releases will work going forward. The era of releasing the most capable model to everyone is over. What you get access to is increasingly a subset of what exists.

For developers: upgrade, test your prompts, monitor your token usage, and use task budgets. The model is better. For everyone else: the most interesting part of this release is everything Anthropic chose not to release.