What do GPT-5 benchmarks prove?

They indicate real-world code editing and bug-fixing accuracy at scale.

How do I keep GPT-5 costs low?

Provide focused context and avoid long reasoning on simple tasks.

GPT-5 for Coding 2025: Benchmarks, Pricing & 5 Pro Tips

GPT-5 for Coding: Benchmarks & Pricing

Description: GPT-5 finally ships with real gains in coding. Here’s what the benchmarks actually mean, what it costs, the catches, and five tips to get better results.

If you missed the livestream (and the�?“creative�?charts), here’s the short of it: GPT-5 is out, it’s fast, and—crucially—it’s better at real code than earlier models. There’s legit signal in the numbers, plus a few launch-week faceplants worth knowing about. (Business Insider)

What the numbers actually say (and why they matter)

Real-repo bug fixing (SWE-bench Verified): 74.9%. This benchmark gives a model a real GitHub repo + issue and counts it as a win only if its patch passes the tests. GPT-5 sets the new top score here. Translation: it’s not just autocomplete; it can land fixes that compile and pass. (OpenAI, swebench.com)
Code-editing (Aider Polyglot): 88%. This one measures whether the model can reliably produce correct diffs/whole-file edits across tough Exercism problems in multiple languages. High scores here usually mean fewer malformed patches and less babysitting in your editor. (OpenAI, aider.chat)
Efficiency vs. o3 on tough tasks: OpenAI reports GPT-5 hits those results with ~22% fewer output tokens and ~45% fewer tool calls (at comparable high-reasoning settings). That matters for both latency and your cloud bill. (OpenAI)

Nerd note: SWE-bench Verified is a 500-task human-filtered slice of real issues; OpenAI says 23 tasks were omitted in their run due to infra quirks—so don’t compare raw percentages without reading footnotes. (swebench.com, OpenAI)

How this shows up in your editor

OpenAI says GPT-5 is now the default model in ChatGPT for signed-in users, with a beefier “GPT-5 Thinking�?toggle for harder prompts. There’s also a real-time router that decides when to think longer vs. answer quickly—useful in theory, but you can also force deep-thinking by literally asking it to �?think hard about this*.�?(OpenAI)

For developers, the API variant is tuned for agentic coding (tool use, multi-step plans) and was tested with popular coding tools (Cursor, Windsurf, Copilot, etc.). This is where that “fewer tool calls, better results�?claim should shine. (OpenAI)

Pricing in plain English

GPT-5 API: $1.25 per 1M input tokens; $10 per 1M output tokens; cached input $0.125 per 1M.
GPT-5 mini / nano exist if you need cheaper & faster for simpler jobs. (And yes, GPT-5 is rolling out across ChatGPT—free and paid—with tiered limits.) (OpenAI)

Reality check: it’s not all roses 🌹

The launch demo had some�?chart crimes (OpenAI acknowledged and fixed them). If you felt gaslit by bar lengths, you weren’t alone. (Business Insider)
In the wild, routing can misfire and GPT-5 will still bungle basics now and then (spelling, geography—yikes). It’s powerful, not perfect. (卫报)

Five quick tips to actually get better code out of GPT-5

Say the magic words: add �?*think hard about this**�?(or pick GPT-5 Thinking) for cross-file refactors, gnarly bugs, or green-field scaffolding. (OpenAI)
Feed it the right context: repo map, key files, failing tests, and error traces. These benchmarks reward realistic setup for a reason. (swebench.com)
Ask for diffs + self-checks: request
text
git diff
/patch format and a post-patch test run plan. That mirrors how it wins on Aider & SWE-bench. (OpenAI, aider.chat)
Let it use tools, but audit results: the model’s better at chaining steps with fewer calls—still review the plan like you would a junior dev’s. (OpenAI)
Mind your tokens: long “thinky�?answers are pricey; use them where they pay off (architecture, debugging), not for renaming variables. (OpenAI)

So�?is GPT-5 “good at programming�?

Short answer: Yes—best-in-class on real-world code benchmarks right now, with practical gains in stability and tool use. It won’t replace tests, code review, or your taste in API design—but it will ship more fixes on the first try and waste fewer cycles doing it. And with ChatGPT nearing 700M weekly users, expect your teammates (and competitors) to be piloting it soon. (OpenAI)

Sources and further reading

OpenAI: Introducing GPT-5 (benchmarks, routing, availability). (OpenAI)
OpenAI: GPT-5 for Developers (SWE-bench details, efficiency vs. o3). (OpenAI)
OpenAI API Pricing (official token rates). (OpenAI)
SWE-bench (what “Verified�?means). (swebench.com)
Aider Polyglot (what that code-editing test measures). (aider.chat)

Share This Article

Found this article helpful? Share it with your network to help others discover it too.