OpenAI
Comparing 27 models from OpenAI.
| Model | Total ▼ | Released | Context | CoT | Tooling | Creative Writing | Language | Utility | Reasoning | Text Editing | Rule Following | Hallucination |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.4 (Reasoning) | 93.85% | Mar 5, 26 | 1m | ✓ | 100.00% | 91.17% | 94.90% | 96.89% | 94.89% | 98.42% | 79.29% | 95.22% |
| GPT-5.5 (Reasoning) | 93.72% | Apr 24, 26 | 1m | ✓ | 100.00% | 90.26% | 99.69% | 96.60% | 92.87% | 98.79% | 79.40% | 92.12% |
| GPT-5.5 (Reasoning, Low) | 92.51% | Apr 24, 26 | 1m | ✓ | 98.50% | 90.24% | 99.24% | 96.36% | 88.87% | 98.59% | 76.90% | 91.37% |
| GPT-5 | 91.48% | Aug 7, 25 | 400k | ✓ | 98.67% | 86.87% | 91.50% | 93.53% | 89.34% | 98.90% | 77.13% | 95.92% |
| GPT-5 Mini | 91.31% | Aug 7, 25 | 400k | ✓ | 99.99% | 80.48% | 96.49% | 98.39% | 83.31% | 97.13% | 76.44% | 98.28% |
| GPT-5.4 (Reasoning, Low) | 90.91% | Mar 5, 26 | 1m | ✓ | 99.97% | 90.51% | 90.79% | 95.32% | 86.99% | 98.01% | 70.02% | 95.71% |
| GPT-5.1 | 90.73% | Nov 13, 25 | 400k | ✓ | 98.33% | 87.20% | 93.64% | 95.33% | 82.57% | 98.54% | 74.05% | 96.22% |
| GPT-5.4 Mini (Reasoning) | 89.82% | Mar 17, 26 | 400k | ✓ | 100.00% | 88.66% | 98.12% | 94.44% | 86.10% | 95.78% | 57.38% | 98.11% |
| GPT-5.2 | 89.45% | Dec 10, 25 | 400k | – | 100.00% | 80.36% | 91.19% | 96.22% | 85.61% | 97.54% | 67.10% | 97.55% |
| GPT-5.5 | 89.37% | Apr 24, 26 | 1m | – | 100.00% | 90.39% | 94.15% | 81.88% | 87.68% | 98.20% | 72.34% | 90.28% |
| o4 Mini High | 88.78% | Apr 16, 25 | 200k | ✓ | 100.00% | 82.72% | 79.76% | 98.67% | 82.51% | 94.36% | 72.70% | 99.53% |
| GPT-4.1 | 86.82% | Apr 14, 25 | 1m | – | 97.71% | 81.24% | 93.91% | 90.57% | 74.40% | 94.40% | 66.78% | 95.60% |
| o4 Mini | 86.56% | Apr 16, 25 | 200k | ✓ | 100.00% | 82.04% | 80.00% | 96.31% | 80.39% | 90.61% | 64.61% | 98.54% |
| GPT-OSS 120B | 84.81% | Aug 5, 25 | 131k | ✓ | 100.00% | 67.85% | 97.18% | 92.03% | 77.00% | 91.73% | 55.03% | 97.63% |
| GPT-5.4 | 84.31% | Mar 5, 26 | 1m | – | 98.04% | 90.94% | 81.49% | 81.95% | 80.92% | 96.73% | 58.11% | 86.32% |
| GPT-5.4 Mini (Reasoning, Low) | 83.57% | Mar 17, 26 | 400k | ✓ | 100.00% | 87.72% | 92.45% | 88.49% | 74.50% | 92.63% | 33.99% | 98.78% |
| GPT-4o, Aug. 6th (temp=0) | 82.18% | Aug 6, 24 | 128k | – | 99.96% | 73.65% | 75.00% | 82.11% | 75.40% | 93.77% | 74.19% | 83.33% |
| GPT-4.1 Mini | 81.40% | Apr 14, 25 | 1m | – | 96.94% | 74.52% | 89.64% | 82.30% | 67.35% | 95.62% | 58.59% | 86.23% |
| GPT-4o, Aug. 6th (temp=1) | 81.28% | Aug 6, 24 | 128k | – | 98.86% | 75.50% | 82.21% | 82.44% | 70.58% | 86.72% | 67.91% | 86.00% |
| GPT-5.4 Mini | 80.45% | Mar 17, 26 | 400k | – | 98.55% | 88.10% | 88.75% | 79.37% | 70.07% | 90.60% | 46.32% | 81.85% |
| GPT-5 Nano | 80.16% | Aug 7, 25 | 400k | ✓ | 99.34% | 67.04% | 77.18% | 93.91% | 70.50% | 82.74% | 57.57% | 92.99% |
| GPT-5.4 Nano (Reasoning) | 80.02% | Mar 17, 26 | 400k | ✓ | 96.43% | 80.97% | 83.99% | 93.34% | 77.96% | 83.32% | 27.15% | 96.98% |
| GPT-4o Mini (temp=1) | 77.82% | Jul 18, 24 | 128k | – | 98.71% | 74.37% | 77.50% | 82.16% | 62.26% | 85.78% | 56.50% | 85.28% |
| GPT-5.4 Nano (Reasoning, Low) | 77.46% | Mar 17, 26 | 400k | ✓ | 91.46% | 80.93% | 81.87% | 91.42% | 64.43% | 82.23% | 31.65% | 95.71% |
| GPT-4o Mini (temp=0) | 76.86% | Jul 18, 24 | 128k | – | 97.45% | 73.10% | 75.00% | 81.43% | 62.17% | 84.62% | 58.84% | 82.23% |
| GPT-5.4 Nano | 72.16% | Mar 17, 26 | 400k | – | 91.85% | 80.50% | 80.82% | 78.57% | 58.53% | 79.22% | 20.94% | 86.86% |
| GPT-4.1 Nano | 69.90% | Apr 14, 25 | 1m | – | 78.47% | 71.81% | 78.95% | 68.45% | 54.55% | 76.06% | 40.88% | 90.07% |
Model Performance
Cost vs Performance
Compares total benchmark cost against overall score for OpenAI models. Quadrant lines are drawn at the median values.
Cost Breakdown
Total benchmark cost per model, broken down by input, reasoning, and output tokens. Toggle between USD and token views.