Claude Opus 4.7: Complete Guide

Claude Opus 4.7 Review: Benchmarks, Features & What It Means for Developers | The AI and Tech Society
digitalstrategy-ai.com · Vol. 04 · Issue 16
Model Review · Anthropic · April 2026

Claude Opus 4.7: the quiet upgrade that changes the buying decision.

Same price as Opus 4.6. Better coding on the hard stuff. Triple the vision resolution. A new reasoning tier called xhigh. Here is what actually matters for developers and tech leaders — backed by the launch benchmarks and partner evals.

In one paragraph: Claude Opus 4.7 is Anthropic’s latest flagship, released April 16, 2026. It keeps Opus 4.6 pricing ($5 input / $25 output per million tokens) while posting double-digit gains on hard coding, long-horizon agent work, and visual tasks. Partners report 13% more coding tasks solved (GitHub), 3× more production tasks resolved (Rakuten), 98.5% vs 54.5% on visual acuity (XBOW), and a third of the tool errors on multi-step workflows (Notion). Sonnet 4.6 is still the everyday default. Opus 4.7 is the model you reach for when mistakes are expensive.

TL;DR · Five things to know

  • Price unchanged at $5/$25 per million tokens — a straight capability upgrade for teams already on Opus.
  • Hard coding is where it shines — 13% lift on GitHub’s 93-task benchmark, 3× on Rakuten SWE Bench, 70% vs 58% on CursorBench.
  • Vision jumped from 0.9 MP to 3.75 MP — dense dashboards, UI screenshots, and technical diagrams are now genuinely readable.
  • New xhigh effort tier plus task budgets and /ultrareview in Claude Code give teams real control over cost vs depth.
  • Watch token usage. New tokenizer = 1.0–1.35× more tokens on the same input. Measure on real traffic before full migration.

Every Claude release arrives with the same question attached: is this a real step up, or a point-release marketing exercise? With Opus 4.7, the answer is unusually clear. Anthropic did not try to move every benchmark at once. They pushed hard on the places where Opus was already being trusted with expensive work — long agent runs, difficult debugging, dense document analysis — and made those places measurably better. The model that comes out the other side feels less like an incremental bump and more like a quiet reset of what “Opus-tier” means in 2026.

The launch is also unusually honest. Anthropic spells out that the new tokenizer will cost you more tokens per input, that the model thinks harder at high effort levels, and that prompts which previously worked by accident (because the model ignored messy instructions) may now behave differently. None of this is hidden. That alone tells you how Anthropic expects teams to evaluate the upgrade: carefully, on real traffic, with measurement. Let’s walk through what changed, what the numbers actually say, and what it means for developers and technology leaders planning their 2026 AI stack.

What is Claude Opus 4.7?

Short answer: Claude Opus 4.7 is Anthropic’s current generally available flagship model (released April 16, 2026), positioned for hard coding, long-running agentic tasks, and high-stakes analysis. It is priced identically to Opus 4.6 at $5 per million input tokens and $25 per million output tokens, available via Claude apps, the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

Opus 4.7 sits at the top of Anthropic’s generally available lineup — below the more constrained Mythos Preview (which Anthropic is deliberately gating for cyber safety reasons) and above Sonnet 4.6 and Haiku 4.5. The API model string is claude-opus-4-7. Both Opus 4.7 and Sonnet 4.6 support the 1 million token context window, so the choice between them is about capability per dollar and per second — not context length.

The positioning Anthropic chose is worth noting: Opus 4.7 is being pitched not as the fastest or cheapest model, but as the one that handles “complex, long-running tasks with rigor and consistency” and that “devises ways to verify its own outputs before reporting back.” That framing — a model that checks itself — is the thread that runs through every partner testimonial in the launch.

The coding benchmarks, unpacked

Short answer: On difficult, multi-step software engineering, Opus 4.7 is the strongest generally available Claude model to date. The most striking numbers: +13% over Opus 4.6 on GitHub’s 93-task coding benchmark, more production tasks resolved on Rakuten SWE Bench, +14% on Notion’s multi-step workflow eval (with one-third the tool errors), and 70% vs 58% on CursorBench.
Opus 4.7 vs Opus 4.6 — coding & agent benchmarks
Partner-reported deltas from Anthropic’s launch evaluations (higher = better)
GitHub 93-taskcoding benchmark
Opus 4.7
+13%
Rakuten SWE Benchproduction tasks
Opus 4.7
+3×
CursorBenchagentic coding
Opus 4.7 · 70%
vs 58%
Notion workflowsmulti-step tasks
Opus 4.7
+14%
Factory Droidstask success
Opus 4.7
+10–15%
CodeRabbit recallbug detection
Opus 4.7
+10%
Bolt app-buildinglong-running work
Opus 4.7
+10%
Source: Partner testimonials, Anthropic launch post (Apr 16, 2026). Methodologies vary by partner.

Read the bars together and a pattern appears. These are not parlor-trick gains on isolated puzzles — they are lifts on the kind of work developers actually ship: multi-step tasks, production fixes, code review, long-running builds. GitHub reported that Opus 4.7 solved four tasks that neither Opus 4.6 nor Sonnet 4.6 could solve at all. CodeRabbit said precision held steady while recall jumped over 10%, which is the hard combination — most models trade one for the other. Notion’s number is the one I’d circle: a third of the tool errors on multi-step workflows. That is the difference between an agent you trust to run overnight and one you babysit.

Two observations matter beyond the headline percentages. First, honesty improved. Hex specifically called out that Opus 4.7 correctly reports when data is missing, where Opus 4.6 would sometimes fabricate a plausible fallback. Vercel noted the model now does proofs on systems code before starting work — behavior no previous Claude had shown. Second, loop resistance improved. Genspark reported the highest quality-per-tool-call ratio they’ve measured, and flagged that earlier models would loop indefinitely on roughly 1 in 18 queries. That is the kind of failure mode that kills production deployments, and Opus 4.7 is meaningfully better at avoiding it.

Opus 4.7 is the reliability jump that makes agentic AI feel less like a demo and more like a teammate. The numbers on tool errors, honesty, and loop resistance are the ones to watch — not the raw benchmark scores. — The AI & Tech Society Editorial View

The vision upgrade nobody is talking about enough

Short answer: Opus 4.7 can now process images up to 2,576 pixels on the long edge (~3.75 megapixels) — more than 3× the resolution of previous Claude models. One early tester (XBOW) reported 98.5% on their visual acuity benchmark vs 54.5% for Opus 4.6. If your work involves dashboards, UI screenshots, technical diagrams, or chemical structures, this upgrade may matter more than the coding gains.
3.75 MP Max image resolution
(up from ~0.9 MP)
98.5% Visual acuity (XBOW eval)
vs 54.5% on Opus 4.6
3×+ Pixels processed per image
vs prior Claude models

The resolution jump looks like a spec-sheet item. It isn’t. Anything involving small on-screen text — a dense Datadog dashboard, a Figma export with labels at 10pt, a spreadsheet screenshot, a technical diagram with footnotes — sat in a grey zone with previous Claude models. You got roughly the right answer, but not reliably. At 3.75 megapixels, that grey zone shrinks dramatically. XBOW, which does autonomous penetration testing using computer-use agents, said their “single biggest Opus pain point effectively disappeared.” Solve Intelligence called out major gains on chemical structures. Anthropic V0 (the design-interface company, not to be confused with the internal model) said Opus 4.7 is now “the best model in the world for building dashboards and data-rich interfaces.”

For product teams, life sciences, design tooling, QA automation, and any workflow where a model looks at a screen and acts on what it sees, this is the sleeper upgrade of the release.

Opus 4.7 vs Opus 4.6 vs Sonnet 4.6: which should you use?

Short answer: Use Sonnet 4.6 for high-volume, cost-sensitive everyday work. Use Opus 4.7 for hard coding, long agent runs, security and compliance review, finance and legal analysis — any task where the cost of a wrong answer outweighs the cost of tokens. Opus 4.6 has no remaining role for new projects; Opus 4.7 is a drop-in replacement at the same price.
  Claude Opus 4.7 Claude Opus 4.6 Claude Sonnet 4.6
Positioning Flagship for hard work Previous flagship Speed & intelligence balance
Input price / 1M tokens $5 $5 $3
Output price / 1M tokens $25 $25 $15
Context window 1M tokens 1M tokens 1M tokens
Latency Moderate Moderate Fast
Max image resolution ~3.75 MP ~0.9 MP ~0.9 MP
Effort levels low / med / high / xhigh / max low / med / high / max low / med / high / max
Best for Hard debugging, agents, long refactors, security, legal/finance — (superseded) High-volume assistant work, lightweight coding, summarisation

The buying logic has rarely been cleaner. If your team already runs on Opus 4.6, upgrading to 4.7 is the easiest decision of the quarter — same price, better on the work you’re paying Opus rates to do anyway. If your team is on Sonnet 4.6 and happy, there is no strong reason to switch unless you are hitting quality ceilings on hard, multi-step work. The category that should seriously reconsider is anyone currently running a mixed stack where Sonnet does triage and a different frontier model handles hard work — Opus 4.7’s honesty and loop-resistance gains make the “promote to Opus” path more attractive than it was a month ago.

New features and workflow controls

Short answer: The model ships alongside four platform changes: a new xhigh effort level (between high and max), task budgets in public beta on the API for controlling long-run token spend, a /ultrareview slash command in Claude Code for dedicated code review sessions, and auto mode extended to Max users for fewer permission interruptions.
xhigh effort
NEW TIER

A reasoning level between high and max, giving finer control over the quality/latency tradeoff. Now the default in Claude Code across all plans.

Effort · API & Claude Code
Task budgets
PUBLIC BETA

Developers set token budgets per task, letting Claude prioritise work across longer agentic runs. Useful for controlling spend in overnight jobs.

API · public beta
/ultrareview
NEW COMMAND

A dedicated code-review session that reads changes end-to-end and flags bugs and design issues. Three free runs for Pro and Max users.

Claude Code · slash command
Auto mode for Max
EXPANDED

Claude makes routine permission decisions itself so long tasks aren’t interrupted — safer than fully skipping permissions, and now available to Max plans.

Claude Code · permissions

Taken together, these are not cosmetic. The xhigh default in Claude Code is Anthropic’s way of saying: “for serious work, think more, by default.” Task budgets are the right answer to a real pain — agents that burn through a day’s budget in their first hour of exploration. And /ultrareview is the clearest product signal that Anthropic sees Opus 4.7 as a code-review-grade model, not just a code-generation model. That’s a meaningfully different claim.

Pricing looks the same — but your bill might not

Short answer: List pricing is unchanged at $5 input / $25 output per million tokens, but two things can move your real spend: Opus 4.7 uses a new tokenizer that maps the same input to roughly 1.0–1.35× more tokens depending on content, and the model thinks more at higher effort levels, producing more output tokens on hard tasks. Anthropic recommends measuring on production-like traffic before full migration.
Plan for this Two token-usage changes to watch:

1. New tokenizer. The same prompt can consume up to 35% more tokens than on Opus 4.6. Exact multiplier depends on content type (code, natural language, structured data all behave differently).

2. Deeper thinking at xhigh/max. Especially on later turns of agent runs, Opus 4.7 produces more reasoning tokens than Opus 4.6. Reliability goes up. Output bill goes up too.

Mitigations: the effort parameter, task budgets, and prompting for concision. Anthropic reports that on their internal coding eval, net token usage improved across all effort levels — but you should run your own measurement.

Implications for developers

Short answer: Retune your prompts (Opus 4.7 follows instructions more literally), start with xhigh effort for coding and agentic work, use task budgets on anything that runs longer than ~15 minutes, and point Opus 4.7 at your hardest bugs and refactors first. The failure modes that mattered most in production — fabricated fallbacks, silent loops, tool errors — are where the gains are biggest.

The prompt-tuning note is not optional. Anthropic was explicit: prompts that used to work because previous Claude models ignored messy or contradictory instructions may now behave unexpectedly, because Opus 4.7 takes instructions literally. This is a quality improvement, but it means your prompt library needs an audit. Cleaner system prompts, explicit definitions of done, and unambiguous success criteria will do more for your output quality on 4.7 than they did on 4.6.

For agent frameworks, the practical playbook is: start tasks with xhigh, not high; wire up task budgets on anything that iterates more than a handful of times; and instrument tool-error rates before and after upgrading, because this is where Opus 4.7’s gains are largest (Notion saw a 3× reduction, Factory saw similar). If your stack does its own verification or has a separate review step, try replacing it with /ultrareview for a week and see whether catch rates improve.

Implications for CTOs and tech leaders

Short answer: The upgrade decision is easy; the policy decisions around it are not. Opus 4.7 is a straight capability upgrade at the same list price, but real spend can shift meaningfully, and more literal instruction-following changes prompt maintenance overhead. Treat the migration as a measurable project, not a flag flip.

Three questions worth putting on a CTO’s desk this month. First, where are we spending Opus budget today, and is the cost of a mistake there greater than the cost of the extra tokens? If yes, Opus 4.7 is the obvious call. If no — if you’re using Opus for high-volume triage work — the real answer might be moving that workload to Sonnet 4.6, not upgrading Opus. Second, what is our tool-error rate and loop rate on production agents? These are the places Opus 4.7 improved most, and the gains translate into fewer on-call incidents, not just better benchmarks. Third, what is our prompt library’s maintenance posture? Teams that treat system prompts as disposable scripts will feel pain on this upgrade. Teams that version-control and test them will see gains immediately.

There’s also a narrative point worth flagging for leadership communication. Opus 4.7 is deliberately not Anthropic’s most capable model — that’s Mythos Preview, which Anthropic is keeping limited because of cybersecurity concerns flagged under Project Glasswing. Opus 4.7 shipped with new cyber safeguards as a trial run before broader Mythos release. For regulated industries, this context matters: Anthropic is demonstrating that it can gate capability for safety reasons and still ship a commercially useful flagship. That pattern, not the benchmark bars, is the signal worth watching.

The upgrade decision for Opus 4.7 is easy. The harder question is whether your workloads are on the right Claude model in the first place — and the answer is increasingly: audit, don’t assume. — The AI & Tech Society Editorial View

Is Claude Opus 4.7 worth upgrading to? (Final take)

Short answer: Yes for any team currently running Opus 4.6 on hard coding, long agent runs, or vision-heavy workflows — it’s a same-price capability lift with the clearest gains on the failure modes that hurt most in production. Maybe not for teams on Sonnet 4.6 doing everyday work, where Sonnet’s speed and price still win. Either way, measure token usage on real traffic before full rollout.

Opus 4.7 is not a revolution. It is something rarer in the current LLM cycle: a release that made the model better at the things buyers were already paying for, without trying to reframe the market around a new capability axis. Coding on hard tasks: better. Long agent runs: more reliable. Vision: meaningfully sharper. Honesty: improved. Price: untouched. The new platform features — xhigh, task budgets, /ultrareview, expanded auto mode — address the exact pain points teams have been flagging all year.

The simple summary I keep coming back to: Sonnet is still the everyday driver. Opus 4.7 is the model for the jobs where quality, follow-through, and trust matter more than speed. If your team lives in that second category often enough — and a growing number do — this release is important. Not because it changes what AI can do, but because it raises the floor on what you can reliably hand off.

Frequently asked questions

Short answer: Quick answers to the most common questions about Claude Opus 4.7 — release date, pricing, comparison with other models, and migration guidance.
When was Claude Opus 4.7 released?
Claude Opus 4.7 became generally available on April 16, 2026, across Claude apps, the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. The API model string is claude-opus-4-7.
How much does Claude Opus 4.7 cost?
Pricing is unchanged from Opus 4.6: $5 per million input tokens and $25 per million output tokens. Note that a new tokenizer means the same input may consume 1.0–1.35× more tokens, so real spend can shift despite identical list prices.
Is Claude Opus 4.7 better than GPT-5.4 or Gemini 3.1 Pro for coding?
Anthropic’s published charts compare Opus 4.7 against the best reported API versions of GPT-5.4 and Gemini 3.1 Pro across coding, office tasks, vision, document reasoning, and long-context evaluations. Opus 4.7 posts leading or competitive scores across most categories, with particular strength on agentic and long-horizon coding. Comparative results vary by benchmark — teams should run their own evals on production-representative tasks.
Should I switch from Sonnet 4.6 to Opus 4.7?
Only if your workload is bottlenecked by quality rather than speed or cost. Sonnet 4.6 remains the right default for high-volume, lower-complexity work at $3/$15 per million tokens. Opus 4.7 is for hard coding, long agent runs, security review, and high-stakes analysis where a wrong answer costs more than the extra tokens.
What is the xhigh effort level and when should I use it?
xhigh is a new reasoning tier between high and max, introduced with Opus 4.7. Anthropic recommends starting with high or xhigh for coding and agentic use cases. In Claude Code, xhigh is the default across all plans. Use max only for the hardest problems where the additional thinking time is justified.
Can Claude Opus 4.7 run autonomously for long periods?
Yes — and this is one of its largest gains over Opus 4.6. Devin’s team reported Opus 4.7 “works coherently for hours” on long-horizon tasks, while Genspark highlighted loop resistance and graceful error recovery as its strongest production differentiators. Task budgets (new in this release) give developers explicit control over token spend across long runs.
Does Claude Opus 4.7 support image input?
Yes, with a significant resolution upgrade. Opus 4.7 accepts images up to 2,576 pixels on the long edge (approximately 3.75 megapixels) — more than three times the resolution of earlier Claude models. This makes it meaningfully better for dense dashboards, UI screenshots, technical diagrams, and chemical structures.

Further reading


Discover more from The Tech Society

Subscribe to get the latest posts sent to your email.

Leave a Reply