ChatGPT 5.5 – New OpenAI Model

GPT-5.5 Review: Benchmarks, Pricing & What It Means for Developers | The AI and Tech Society
digitalstrategy-ai.com · Vol. 04 · Issue 17
Model Review · OpenAI · April 2026

GPT-5.5: OpenAI’s agentic reset — and a 2× price tag to match.

The first fully retrained base model since GPT-4.5. A 1 million token context window. State-of-the-art on agentic coding. Doubled API pricing. Here is what actually matters for developers and tech leaders — backed by the launch benchmarks and early partner data.

In one paragraph: GPT-5.5 is OpenAI’s new flagship, released April 23, 2026. It is the first fully retrained base model in the GPT-5.x series, designed from the ground up for agentic work — tool use, computer control, long-running tasks. It scores 82.7% on Terminal-Bench 2.0 (13 points ahead of Claude Opus 4.7), 84.9% on GDPval, and 78.7% on OSWorld-Verified, while matching GPT-5.4’s per-token latency. API pricing doubles to $5 / $30 per million tokens, but ~40% lower token usage softens the real cost increase. GPT-5.5 Pro, a higher-accuracy variant, is available in ChatGPT for Pro, Business, and Enterprise users. Rolling out now to paid tiers in ChatGPT and Codex; API access coming soon.

TL;DR · Six things to know

  • First fully retrained base model since GPT-4.5. Every 5.x release between them was a post-training iteration — GPT-5.5 is not.
  • Built for agents. Terminal-Bench 2.0: 82.7% (+7.6 pts vs GPT-5.4). OSWorld: 78.7%. GDPval: 84.9% across 44 occupations.
  • 1 million token context window in the API — OpenAI’s first API model with that context size.
  • API pricing doubles to $5 / $30 per million tokens. Token efficiency (~40% fewer output tokens) softens the blow to roughly +20% net.
  • GPT-5.5 Pro variant leads on BrowseComp (90.1%) and FrontierMath Tier 1-3 (52.4%) — for Pro/Business/Enterprise in ChatGPT only.
  • Watch the hallucination trade-off. Highest accuracy recorded (57% on AA-Omniscience) but 86% hallucination rate when wrong.

A week after Anthropic shipped Claude Opus 4.7, OpenAI answered. GPT-5.5 landed on April 23 with an unusually direct message: the previous pace of incremental point releases is over, this is a new base model, and it is designed for a different job than GPT-5.4 was. Where GPT-5.x was framed as a unified reasoning system that routes questions through a chain of thought, GPT-5.5 is framed as an agent — something that “takes a sequence of actions, uses tools, checks its own work, and keeps going until a task is finished.” That framing is not marketing. It shows up in what OpenAI chose to benchmark, what they chose not to, and where the price sits.

The release is also the clearest competitive shot of the year so far. OpenAI’s own comparison charts line GPT-5.5 up against Claude Opus 4.7 and Gemini 3.1 Pro on roughly twenty benchmarks. On agentic and computer-use evaluations the lead is clean. On some others — notably SWE-Bench Pro and raw hallucination discipline — the competitors hold their ground. And the pricing change is the boldest move OpenAI has made in the 5.x series: doubled per-token input and output rates, offset by claimed token efficiency gains. Let’s unpack what it actually means.

What is GPT-5.5?

Short answer: GPT-5.5 is OpenAI’s latest flagship model, released April 23, 2026. It is the first fully retrained base model since GPT-4.5 — meaning the architecture, pretraining, and agentic objectives were all reworked rather than iterated on. It ships with a 1 million token context window (OpenAI’s first API model at that size), matches GPT-5.4’s per-token latency, and uses roughly 40% fewer output tokens on comparable Codex tasks. It is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, with API access “coming soon.”

Two things separate this release from the four point releases that preceded it. First, architectural: GPT-5.1, 5.2, 5.3, and 5.4 were all post-training iterations on the same GPT-5 base. GPT-5.5 is not. OpenAI retrained the base model end-to-end with agent-oriented objectives baked into pretraining rather than bolted on in fine-tuning. Second, positioning: every previous 5.x release was pitched as a general-purpose upgrade. GPT-5.5 is pitched specifically as the model that lets you hand off multi-step work — writing code, browsing the web, operating software, filling spreadsheets, debugging — without re-prompting at every handoff.

In practice this shows up as a split product line. In ChatGPT, the default “Thinking” variant is now GPT-5.5 — it replaces GPT-5.4 outright. Above that sits GPT-5.5 Pro, available to Pro, Business, and Enterprise users, positioned as a higher-accuracy iterative research partner for the hardest work. And OpenAI has added explicit effort controls — non-reasoning, low, medium, high, and xhigh — creating a flexible cost/quality profile across a single model family.

The benchmarks, and who actually wins

Short answer: On agentic and computer-use benchmarks, GPT-5.5 is the clear state of the art: 82.7% on Terminal-Bench 2.0 vs 69.4% for Claude Opus 4.7, 78.7% on OSWorld-Verified, 98.0% on Tau2-bench Telecom, and 84.9% on GDPval. On SWE-Bench Pro and hallucination rates, Claude Opus 4.7 still leads. GPT-5.5 at xhigh effort scores 60 on the Artificial Analysis Intelligence Index — 3 points ahead of the prior three-way tie at the top.
GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro
Benchmark comparison across agentic, knowledge work, and coding evaluations (higher = better)
Terminal-Bench 2.0 Agentic command-line workflows
GPT-5.5
82.7%
Opus 4.7
69.4%
Gemini 3.1
68.5%
GDPval Knowledge work across 44 occupations
GPT-5.5
84.9%
Opus 4.7
~80%
OSWorld-Verified Operating real computer environments
GPT-5.5
78.7%
Tau2-bench Telecom Customer-service workflows (no prompt tuning)
GPT-5.5
98.0%
SWE-Bench Pro Real-world software engineering — Opus 4.7 wins
Opus 4.7
64.3%
GPT-5.5
58.6%
FinanceAgent Finance analyst workflows
GPT-5.5
60.0%
Internal IB Modeling Investment-banking spreadsheet modeling
GPT-5.5
88.5%
GPT-5.5
Claude Opus 4.7
Gemini 3.1 Pro
Sources: OpenAI announcement (Apr 23, 2026), third-party aggregations. Cross-vendor scores in OpenAI’s grid are not independently verified by Anthropic or Google.

Three observations from the bars. One: the agentic lead is real, not marginal. A 13-point gap on Terminal-Bench 2.0 is not within measurement noise — it reflects OpenAI’s decision to retrain the base model specifically for tool use and computer operation. Two: coding is split. GPT-5.5 wins on agentic command-line work where planning and tool coordination dominate, but Claude Opus 4.7 still wins on SWE-Bench Pro, which most closely mirrors typical engineering tasks. If your definition of “coding” is closer to “write this function” than “run this multi-step migration,” the Opus advantage holds. Three: some of these numbers come from OpenAI’s own grid, meaning the competitor scores were not independently verified by Anthropic or Google. Treat the ranking as directional; run your own eval before betting production on it.

GPT-5.5 is not trying to win every benchmark. It is trying to be the model you run when you want to hand over a whole task — not a prompt — and come back later. That is a different claim than “better than Opus at code.” — The AI & Tech Society Editorial View

The agentic story in four numbers

Short answer: GPT-5.5’s defining claim is long-horizon autonomy. OSWorld-Verified (computer operation) at 78.7%, Tau2-bench Telecom (customer workflows) at 98.0% without prompt tuning, GDPval (knowledge work across 44 occupations) at 84.9%, and GeneBench (multi-day scientific research tasks) at leading performance. These are the numbers that matter if your use case is “Claude does the work while I sleep.”
82.7% Terminal-Bench 2.0
agentic CLI work
78.7% OSWorld-Verified
computer operation
98.0% Tau2 Telecom
no prompt tuning
84.9% GDPval
44 occupations

The 98.0% on Tau2-bench Telecom is the one that deserves a second look. Customer-service workflow benchmarks usually reward prompt engineering heavily — a well-tuned harness can squeeze 10+ points out of a mediocre model. OpenAI explicitly notes this score was achieved without prompt tuning, with GPT-4.1 serving as the user-simulator model. That is the difference between a demo and a deployable product. Similarly, GeneBench and BixBench — multi-day scientific data analysis tasks — are the kind of evaluation where most models fail not because they can’t reason but because they lose the thread halfway through. GPT-5.5’s improvement there is a coherence claim, not an intelligence claim.

GPT-5.5 Pro: what it is and when to use it

Short answer: GPT-5.5 Pro is a higher-accuracy, higher-latency variant available in ChatGPT for Pro, Business, and Enterprise users. It leads on BrowseComp (90.1%) and FrontierMath Tier 1–3 (52.4%), making it the strongest OpenAI variant for deep web research, advanced mathematics, and iterative scientific reasoning. Expect roughly 6× the inference cost of standard GPT-5.5 for a few extra percentage points of reliability — worth it on high-stakes, low-volume work.

The Pro variant is best understood as an “iterative research partner” rather than a faster chatbot. Where standard GPT-5.5 Thinking is optimized to handle a full agentic task end-to-end at reasonable cost and latency, GPT-5.5 Pro is tuned for the handful of tasks per week where an additional 2-5 percentage points of accuracy justifies substantially more compute. Legal analysis of complex contracts, deep web research with citations that actually need to be correct, mathematical proofs, and scientific literature synthesis are the natural fit. Everyday agentic coding is not.

An interesting note from the launch: an internal variant of GPT-5.5 (with a customized harness) reportedly found a new proof relating to Ramsey numbers in combinatorics, subsequently verified in Lean. That’s the sort of result that matters less as a product signal and more as a capability signal — research-grade math is no longer obviously out of reach for a commercial frontier model.

The pricing change, honestly

Short answer: GPT-5.5 API pricing is $5 per million input tokens and $30 per million output tokens — double GPT-5.4’s $2.50 / $15 rate. OpenAI’s argument: roughly 40% fewer output tokens on comparable tasks keeps the real cost increase closer to +20% net. Whether that math holds depends entirely on your workload. Measure before migrating.
Previous generation
GPT-5.4
Input / 1M tokens$2.50
Output / 1M tokens$15.00
Context window400K
New (coming to API)
GPT-5.5
Input / 1M tokens$5.00
Output / 1M tokens$30.00
Context window1M
Plan for this Three pricing realities to weigh before migrating:

1. Per-token price doubled. The largest single-release price jump in the GPT-5.x series. On raw tokens, a GPT-5.4 workload moved to 5.5 without any changes will cost roughly 2× more.

2. Token efficiency partially offsets it. OpenAI reports ~40% fewer output tokens on comparable Codex tasks, bringing real cost increase closer to +20%. But this varies wildly by workload — agentic coding sees larger gains than pure Q&A.

3. At medium effort, GPT-5.5 reportedly matches Claude Opus 4.7 at roughly a quarter of Opus 4.7’s inference cost in some workloads. The effort dial matters more than the list price.

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro: how to choose

Short answer: Use GPT-5.5 for agentic workflows, computer-use tasks, and broad knowledge work where autonomy matters more than raw code correctness. Use Claude Opus 4.7 for production coding, code review, and workloads where hallucination discipline is critical. Use Gemini 3.1 Pro where you are already deep in the Google ecosystem or need its multimodal strengths. For most teams, the right answer is a multi-model stack — not a single winner.
  GPT-5.5 Claude Opus 4.7 Gemini 3.1 Pro
Positioning Agentic work & computer use Hard coding & long runs Multimodal & Google stack
Input / 1M tokens $5.00 $5.00 ~$1.25–$5.00
Output / 1M tokens $30.00 $25.00 ~$10–$15
Context window 1M tokens 1M tokens 1M+ tokens
Terminal-Bench 2.0 82.7% 69.4% 68.5%
SWE-Bench Pro 58.6% 64.3%
Hallucination rate* 86% 36% 50%
Best for Agents, computer use, research, finance modeling Production coding, code review, regulated industries Multimodal tasks, Google Cloud integration, cost-sensitive

*AA-Omniscience hallucination rate when the model gives an incorrect answer. Lower is better.

The practical shape of the decision: if your workload has a clear “done” state and multiple tool calls in between (research with citations, spreadsheet builds, data pipelines, computer-use agents), GPT-5.5 is the strongest option on the market today. If your workload is concentrated in code that ships to production (pull requests, bug fixes, refactors where one wrong line costs hours), Opus 4.7’s SWE-Bench Pro lead and much lower hallucination rate matter more than GPT-5.5’s agentic gains. And if you’re running a lot of volume, neither Opus 4.7 nor GPT-5.5 is the right default — that’s still Sonnet 4.6’s or Gemini’s ground.

The era of picking one frontier model for everything is over. GPT-5.5 for agents, Opus 4.7 for code review, Sonnet 4.6 for volume — the question is no longer which model is best, but which portfolio is right for your workload. — The AI & Tech Society Editorial View

The hallucination trade-off nobody should ignore

Short answer: GPT-5.5 achieves the highest accuracy ever recorded on AA-Omniscience (57%) but carries an 86% hallucination rate — meaning when it answers incorrectly, it answers confidently. Claude Opus 4.7 sits at 36%, Gemini 3.1 Pro Preview at 50%. For regulated deployments (legal, healthcare, compliance, finance advice), this gap is a material factor — and it points directly at where verification and retrieval pipelines need to get stronger.
Watch this

GPT-5.5 is simultaneously more accurate on knowledge-heavy tasks and more confidently wrong when it fails. The combination is a specific failure mode: a model that is right often enough that users stop checking, and wrong confidently enough that the errors are hard to catch. For any deployment where accuracy matters — legal, finance, medical, compliance — this changes the verification stack, not just the model choice.

OpenAI has not hidden this. The model is classified as “High” on cybersecurity capabilities under its preparedness framework — below “Critical” but above every previous public release. Early-access partners of roughly 200 organizations tested for approximately eight weeks before general availability, and the company describes this as its strongest safety deployment to date. These are reasonable moves, but they don’t eliminate the calibration problem. Teams building on GPT-5.5 should assume verification layers (retrieval-augmented answers, citation-required outputs, second-model review) are not optional for regulated use.

Implications for developers

Short answer: Start with agentic workflows — this is where GPT-5.5 differentiates, not where it merely matches. Use the effort dial aggressively: medium effort often matches competitor models at a fraction of their cost. Instrument token usage before and after migration, because the 40% efficiency claim is workload-dependent. And build verification into any production path, because the hallucination profile has changed shape.

Three concrete recommendations for engineering teams. First, rewrite agent prompts, don’t port them. GPT-5.5’s agentic improvements come from the base model being trained for long-horizon tool use. Prompts written for GPT-5.4 that wrap the model in heavy scaffolding, step-by-step chains of thought, and defensive retries are likely over-engineered for 5.5 — and the extra tokens will show up on the bill. Start minimal and add scaffolding only where the benchmarks on your workload demand it.

Second, use the effort dial as the primary cost knob. Five effort levels (non-reasoning through xhigh) means a single model can cover the range from cheap-triage to hard-reasoning without switching model IDs. Medium is the surprising sweet spot — reported to match Claude Opus 4.7 performance on many workloads at significantly lower cost. Default to medium, promote to high/xhigh for the tasks that fail at medium, and reserve xhigh for code review and research.

Third, instrument from day one. The pricing jump plus the token efficiency offset plus workload variance means any post-hoc cost modeling will be wrong. Log input tokens, output tokens, and task success rates in parallel with GPT-5.4 for at least two weeks on a representative slice of traffic before committing. This matters especially for any team that already chose Opus 4.7 recently — the comparison is now live, and you have real production data to decide on.

Implications for CTOs and tech leaders

Short answer: Three questions to answer this quarter. First, is your AI stack a portfolio or a single-vendor dependency? GPT-5.5 plus Opus 4.7 plus Sonnet (or Gemini) together beat any one of them alone. Second, how does your verification layer change when hallucination rates go up even as accuracy goes up? Third, is your organization structured to absorb a 2× price jump in exchange for agentic capability gains, or is cost still the primary constraint?

The portfolio question is the most important strategic shift of the quarter. For the first time, three generally available frontier models each have distinct, defensible strengths: GPT-5.5 on agents and computer use, Opus 4.7 on production coding and low hallucination rates, Gemini 3.1 Pro on multimodal and Google-native workflows. The companies shipping the best AI-powered products in Q3 2026 will almost certainly be the ones running all three in different parts of their stack — not the ones that picked a winner.

The verification question is harder and more urgent. GPT-5.5’s accuracy/hallucination profile is a product design issue, not just an eng issue. If your product answers user questions with AI-generated content, your verification UI, your citation discipline, your confidence-scoring, and your fallback behavior all need a review before you move production traffic. OpenAI’s classification of GPT-5.5 as “High” on the preparedness framework is a real signal — not one that should block adoption, but one that should shape how the model is exposed to end users.

Finally, the pricing question is about organizational posture. Teams that optimize primarily for per-token cost will struggle with GPT-5.5; the model is priced as a frontier product, not a commodity. Teams that optimize for task completion cost (tokens × tasks × rework) may find the math improves, especially on agentic workloads where GPT-5.5’s completion rates are meaningfully higher. Which framing your organization uses is largely a function of where AI sits in the budget structure — the models didn’t change, the strategic question did.

Is GPT-5.5 worth it? (Final take)

Short answer: Yes for teams building agents, computer-use products, research tools, or knowledge-work automation — this is where GPT-5.5’s architecture advantage translates directly into product capability. Cautious yes for production coding, where Opus 4.7 still wins on SWE-Bench Pro and hallucination discipline. Measured skepticism on pricing — the list rate doubled, and the efficiency offset is workload-dependent. Run a two-week parallel eval before migrating production traffic.

GPT-5.5 is the most confident release OpenAI has shipped this year, and also the most contested. It is clearly state of the art on the benchmarks it was built for — agentic workflows, computer use, long-horizon knowledge work — and it is clearly not state of the art on some benchmarks it wasn’t, most notably production coding and hallucination discipline. The honest reading is that the frontier has fragmented. There is no single “best” model now; there are a handful of models that each lead on specific, meaningful workloads.

The simplest summary I can offer: GPT-5.5 is the model for teams whose product is “an AI does the work.” Claude Opus 4.7 is the model for teams whose product is “an AI helps my engineers ship code.” Sonnet 4.6 and Gemini 3.1 Pro are still the right defaults for everything else. If your team has an agentic product on the roadmap for 2026 — a research assistant, a computer-use agent, a multi-step automation — GPT-5.5 is the release to evaluate this month, not next quarter. If your team is primarily shipping code, the upgrade is less urgent. Either way, the age of single-model AI stacks is functionally over.

Frequently asked questions

Short answer: Quick answers to the most common questions about GPT-5.5 — availability, pricing, Pro variant, benchmarks, and how it compares to the competition.
When was GPT-5.5 released?
GPT-5.5 became available on April 23, 2026, rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex. API access is listed as “coming soon” pending additional safety and scaling requirements. GPT-5.5 Pro is available in ChatGPT only for Pro, Business, and Enterprise users.
How much will GPT-5.5 cost on the API?
$5 per million input tokens and $30 per million output tokens — double the per-token price of GPT-5.4 ($2.50 / $15). OpenAI claims roughly 40% fewer output tokens on comparable Codex tasks, reducing the net cost increase to approximately +20%. Real cost impact varies heavily by workload type.
Is GPT-5.5 better than Claude Opus 4.7?
On agentic and computer-use benchmarks, yes — GPT-5.5 leads Terminal-Bench 2.0 by roughly 13 points (82.7% vs 69.4%) and tops OSWorld-Verified at 78.7%. On production software engineering (SWE-Bench Pro) and hallucination discipline, Claude Opus 4.7 still leads. The right answer is workload-specific, not absolute.
What makes GPT-5.5 different from GPT-5.4?
GPT-5.5 is the first fully retrained base model since GPT-4.5 — the architecture, pretraining corpus, and agentic objectives were all reworked, not just fine-tuned. GPT-5.1, 5.2, 5.3, and 5.4 were all post-training iterations on the same base. This is why OpenAI frames GPT-5.5 as a “new class of intelligence” rather than another point release.
Can GPT-5.5 operate a computer autonomously?
Yes — and this is where it sets the state of the art. On OSWorld-Verified, which measures whether a model can operate real computer environments, GPT-5.5 scores 78.7%. Combined with tool use (98.0% on Tau2-bench Telecom without prompt tuning) and long-horizon task completion, it is positioned as the leading model for computer-use agents.
Does GPT-5.5 support a 1 million token context window?
Yes. GPT-5.5 is OpenAI’s first API model to ship with a 1 million token context window, matching Claude Opus 4.7 and Sonnet 4.6. This is a meaningful upgrade from GPT-5.4’s 400K context and enables full-codebase or full-document workflows without chunking.
What is GPT-5.5 Pro and who should use it?
GPT-5.5 Pro is a higher-accuracy variant available in ChatGPT for Pro, Business, and Enterprise users. It leads on BrowseComp (90.1%) and FrontierMath Tier 1-3 (52.4%). Use it for deep web research, advanced mathematics, scientific literature synthesis, and complex legal or financial analysis where a few percentage points of accuracy justify substantially more compute.

Further reading


Discover more from The Tech Society

Subscribe to get the latest posts sent to your email.

Leave a Reply