GPT-5.5: OpenAI’s agentic reset — and a 2× price tag to match.
The first fully retrained base model since GPT-4.5. A 1 million token context window. State-of-the-art on agentic coding. Doubled API pricing. Here is what actually matters for developers and tech leaders — backed by the launch benchmarks and early partner data.
In one paragraph: GPT-5.5 is OpenAI’s new flagship, released April 23, 2026. It is the first fully retrained base model in the GPT-5.x series, designed from the ground up for agentic work — tool use, computer control, long-running tasks. It scores 82.7% on Terminal-Bench 2.0 (13 points ahead of Claude Opus 4.7), 84.9% on GDPval, and 78.7% on OSWorld-Verified, while matching GPT-5.4’s per-token latency. API pricing doubles to $5 / $30 per million tokens, but ~40% lower token usage softens the real cost increase. GPT-5.5 Pro, a higher-accuracy variant, is available in ChatGPT for Pro, Business, and Enterprise users. Rolling out now to paid tiers in ChatGPT and Codex; API access coming soon.
TL;DR · Six things to know
- First fully retrained base model since GPT-4.5. Every 5.x release between them was a post-training iteration — GPT-5.5 is not.
- Built for agents. Terminal-Bench 2.0: 82.7% (+7.6 pts vs GPT-5.4). OSWorld: 78.7%. GDPval: 84.9% across 44 occupations.
- 1 million token context window in the API — OpenAI’s first API model with that context size.
- API pricing doubles to $5 / $30 per million tokens. Token efficiency (~40% fewer output tokens) softens the blow to roughly +20% net.
- GPT-5.5 Pro variant leads on BrowseComp (90.1%) and FrontierMath Tier 1-3 (52.4%) — for Pro/Business/Enterprise in ChatGPT only.
- Watch the hallucination trade-off. Highest accuracy recorded (57% on AA-Omniscience) but 86% hallucination rate when wrong.
A week after Anthropic shipped Claude Opus 4.7, OpenAI answered. GPT-5.5 landed on April 23 with an unusually direct message: the previous pace of incremental point releases is over, this is a new base model, and it is designed for a different job than GPT-5.4 was. Where GPT-5.x was framed as a unified reasoning system that routes questions through a chain of thought, GPT-5.5 is framed as an agent — something that “takes a sequence of actions, uses tools, checks its own work, and keeps going until a task is finished.” That framing is not marketing. It shows up in what OpenAI chose to benchmark, what they chose not to, and where the price sits.
The release is also the clearest competitive shot of the year so far. OpenAI’s own comparison charts line GPT-5.5 up against Claude Opus 4.7 and Gemini 3.1 Pro on roughly twenty benchmarks. On agentic and computer-use evaluations the lead is clean. On some others — notably SWE-Bench Pro and raw hallucination discipline — the competitors hold their ground. And the pricing change is the boldest move OpenAI has made in the 5.x series: doubled per-token input and output rates, offset by claimed token efficiency gains. Let’s unpack what it actually means.
What is GPT-5.5?
Two things separate this release from the four point releases that preceded it. First, architectural: GPT-5.1, 5.2, 5.3, and 5.4 were all post-training iterations on the same GPT-5 base. GPT-5.5 is not. OpenAI retrained the base model end-to-end with agent-oriented objectives baked into pretraining rather than bolted on in fine-tuning. Second, positioning: every previous 5.x release was pitched as a general-purpose upgrade. GPT-5.5 is pitched specifically as the model that lets you hand off multi-step work — writing code, browsing the web, operating software, filling spreadsheets, debugging — without re-prompting at every handoff.
In practice this shows up as a split product line. In ChatGPT, the default “Thinking” variant is now GPT-5.5 — it replaces GPT-5.4 outright. Above that sits GPT-5.5 Pro, available to Pro, Business, and Enterprise users, positioned as a higher-accuracy iterative research partner for the hardest work. And OpenAI has added explicit effort controls — non-reasoning, low, medium, high, and xhigh — creating a flexible cost/quality profile across a single model family.
The benchmarks, and who actually wins
Three observations from the bars. One: the agentic lead is real, not marginal. A 13-point gap on Terminal-Bench 2.0 is not within measurement noise — it reflects OpenAI’s decision to retrain the base model specifically for tool use and computer operation. Two: coding is split. GPT-5.5 wins on agentic command-line work where planning and tool coordination dominate, but Claude Opus 4.7 still wins on SWE-Bench Pro, which most closely mirrors typical engineering tasks. If your definition of “coding” is closer to “write this function” than “run this multi-step migration,” the Opus advantage holds. Three: some of these numbers come from OpenAI’s own grid, meaning the competitor scores were not independently verified by Anthropic or Google. Treat the ranking as directional; run your own eval before betting production on it.
The agentic story in four numbers
agentic CLI work
computer operation
no prompt tuning
44 occupations
The 98.0% on Tau2-bench Telecom is the one that deserves a second look. Customer-service workflow benchmarks usually reward prompt engineering heavily — a well-tuned harness can squeeze 10+ points out of a mediocre model. OpenAI explicitly notes this score was achieved without prompt tuning, with GPT-4.1 serving as the user-simulator model. That is the difference between a demo and a deployable product. Similarly, GeneBench and BixBench — multi-day scientific data analysis tasks — are the kind of evaluation where most models fail not because they can’t reason but because they lose the thread halfway through. GPT-5.5’s improvement there is a coherence claim, not an intelligence claim.
GPT-5.5 Pro: what it is and when to use it
The Pro variant is best understood as an “iterative research partner” rather than a faster chatbot. Where standard GPT-5.5 Thinking is optimized to handle a full agentic task end-to-end at reasonable cost and latency, GPT-5.5 Pro is tuned for the handful of tasks per week where an additional 2-5 percentage points of accuracy justifies substantially more compute. Legal analysis of complex contracts, deep web research with citations that actually need to be correct, mathematical proofs, and scientific literature synthesis are the natural fit. Everyday agentic coding is not.
An interesting note from the launch: an internal variant of GPT-5.5 (with a customized harness) reportedly found a new proof relating to Ramsey numbers in combinatorics, subsequently verified in Lean. That’s the sort of result that matters less as a product signal and more as a capability signal — research-grade math is no longer obviously out of reach for a commercial frontier model.
The pricing change, honestly
1. Per-token price doubled. The largest single-release price jump in the GPT-5.x series. On raw tokens, a GPT-5.4 workload moved to 5.5 without any changes will cost roughly 2× more.
2. Token efficiency partially offsets it. OpenAI reports ~40% fewer output tokens on comparable Codex tasks, bringing real cost increase closer to +20%. But this varies wildly by workload — agentic coding sees larger gains than pure Q&A.
3. At medium effort, GPT-5.5 reportedly matches Claude Opus 4.7 at roughly a quarter of Opus 4.7’s inference cost in some workloads. The effort dial matters more than the list price.
GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro: how to choose
| GPT-5.5 | Claude Opus 4.7 | Gemini 3.1 Pro | |
|---|---|---|---|
| Positioning | Agentic work & computer use | Hard coding & long runs | Multimodal & Google stack |
| Input / 1M tokens | $5.00 | $5.00 | ~$1.25–$5.00 |
| Output / 1M tokens | $30.00 | $25.00 | ~$10–$15 |
| Context window | 1M tokens | 1M tokens | 1M+ tokens |
| Terminal-Bench 2.0 | 82.7% | 69.4% | 68.5% |
| SWE-Bench Pro | 58.6% | 64.3% | — |
| Hallucination rate* | 86% | 36% | 50% |
| Best for | Agents, computer use, research, finance modeling | Production coding, code review, regulated industries | Multimodal tasks, Google Cloud integration, cost-sensitive |
*AA-Omniscience hallucination rate when the model gives an incorrect answer. Lower is better.
The practical shape of the decision: if your workload has a clear “done” state and multiple tool calls in between (research with citations, spreadsheet builds, data pipelines, computer-use agents), GPT-5.5 is the strongest option on the market today. If your workload is concentrated in code that ships to production (pull requests, bug fixes, refactors where one wrong line costs hours), Opus 4.7’s SWE-Bench Pro lead and much lower hallucination rate matter more than GPT-5.5’s agentic gains. And if you’re running a lot of volume, neither Opus 4.7 nor GPT-5.5 is the right default — that’s still Sonnet 4.6’s or Gemini’s ground.
The hallucination trade-off nobody should ignore
GPT-5.5 is simultaneously more accurate on knowledge-heavy tasks and more confidently wrong when it fails. The combination is a specific failure mode: a model that is right often enough that users stop checking, and wrong confidently enough that the errors are hard to catch. For any deployment where accuracy matters — legal, finance, medical, compliance — this changes the verification stack, not just the model choice.
OpenAI has not hidden this. The model is classified as “High” on cybersecurity capabilities under its preparedness framework — below “Critical” but above every previous public release. Early-access partners of roughly 200 organizations tested for approximately eight weeks before general availability, and the company describes this as its strongest safety deployment to date. These are reasonable moves, but they don’t eliminate the calibration problem. Teams building on GPT-5.5 should assume verification layers (retrieval-augmented answers, citation-required outputs, second-model review) are not optional for regulated use.
Implications for developers
Three concrete recommendations for engineering teams. First, rewrite agent prompts, don’t port them. GPT-5.5’s agentic improvements come from the base model being trained for long-horizon tool use. Prompts written for GPT-5.4 that wrap the model in heavy scaffolding, step-by-step chains of thought, and defensive retries are likely over-engineered for 5.5 — and the extra tokens will show up on the bill. Start minimal and add scaffolding only where the benchmarks on your workload demand it.
Second, use the effort dial as the primary cost knob. Five effort levels (non-reasoning through xhigh) means a single model can cover the range from cheap-triage to hard-reasoning without switching model IDs. Medium is the surprising sweet spot — reported to match Claude Opus 4.7 performance on many workloads at significantly lower cost. Default to medium, promote to high/xhigh for the tasks that fail at medium, and reserve xhigh for code review and research.
Third, instrument from day one. The pricing jump plus the token efficiency offset plus workload variance means any post-hoc cost modeling will be wrong. Log input tokens, output tokens, and task success rates in parallel with GPT-5.4 for at least two weeks on a representative slice of traffic before committing. This matters especially for any team that already chose Opus 4.7 recently — the comparison is now live, and you have real production data to decide on.
Implications for CTOs and tech leaders
The portfolio question is the most important strategic shift of the quarter. For the first time, three generally available frontier models each have distinct, defensible strengths: GPT-5.5 on agents and computer use, Opus 4.7 on production coding and low hallucination rates, Gemini 3.1 Pro on multimodal and Google-native workflows. The companies shipping the best AI-powered products in Q3 2026 will almost certainly be the ones running all three in different parts of their stack — not the ones that picked a winner.
The verification question is harder and more urgent. GPT-5.5’s accuracy/hallucination profile is a product design issue, not just an eng issue. If your product answers user questions with AI-generated content, your verification UI, your citation discipline, your confidence-scoring, and your fallback behavior all need a review before you move production traffic. OpenAI’s classification of GPT-5.5 as “High” on the preparedness framework is a real signal — not one that should block adoption, but one that should shape how the model is exposed to end users.
Finally, the pricing question is about organizational posture. Teams that optimize primarily for per-token cost will struggle with GPT-5.5; the model is priced as a frontier product, not a commodity. Teams that optimize for task completion cost (tokens × tasks × rework) may find the math improves, especially on agentic workloads where GPT-5.5’s completion rates are meaningfully higher. Which framing your organization uses is largely a function of where AI sits in the budget structure — the models didn’t change, the strategic question did.
Is GPT-5.5 worth it? (Final take)
GPT-5.5 is the most confident release OpenAI has shipped this year, and also the most contested. It is clearly state of the art on the benchmarks it was built for — agentic workflows, computer use, long-horizon knowledge work — and it is clearly not state of the art on some benchmarks it wasn’t, most notably production coding and hallucination discipline. The honest reading is that the frontier has fragmented. There is no single “best” model now; there are a handful of models that each lead on specific, meaningful workloads.
The simplest summary I can offer: GPT-5.5 is the model for teams whose product is “an AI does the work.” Claude Opus 4.7 is the model for teams whose product is “an AI helps my engineers ship code.” Sonnet 4.6 and Gemini 3.1 Pro are still the right defaults for everything else. If your team has an agentic product on the roadmap for 2026 — a research assistant, a computer-use agent, a multi-step automation — GPT-5.5 is the release to evaluate this month, not next quarter. If your team is primarily shipping code, the upgrade is less urgent. Either way, the age of single-model AI stacks is functionally over.
Frequently asked questions
Further reading
- OpenAI’s official announcement: Introducing GPT-5.5
- Our earlier analysis: Claude Opus 4.7 Review — The Quiet Upgrade That Changes the Buying Decision
- Artificial Analysis Intelligence Index for live cross-vendor benchmarks
- GPT-5.5 System Card for full safety and preparedness framework evaluations
Discover more from The Tech Society
Subscribe to get the latest posts sent to your email.