Comparing 21 models from Google.
| Model | Total ▼ | Released | Context | CoT | Tooling | Creative Writing | Language | Utility | Reasoning | Text Editing | Rule Following | Hallucination |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 94.08% | Feb 19, 26 | 1m | ✓ | 99.92% | 85.44% | 94.90% | 99.91% | 88.20% | 98.51% | 91.21% | 94.53% |
| Gemini 3.5 Flash (Reasoning) | 93.35% | May 19, 26 | 1m | ✓ | 100.00% | 79.87% | 94.41% | 98.86% | 88.22% | 97.78% | 92.04% | 95.61% |
| Gemini 3 Flash (Preview, Reasoning) | 89.93% | Dec 17, 25 | 1m | ✓ | 100.00% | 75.87% | 94.93% | 97.20% | 86.16% | 98.12% | 74.48% | 92.65% |
| Gemma 4 31B (Reasoning) | 89.64% | Apr 3, 26 | 256k | ✓ | 98.50% | 78.13% | 83.82% | 96.32% | 80.14% | 98.83% | 85.00% | 96.37% |
| Gemma 4 26B (Reasoning) | 89.02% | Apr 3, 26 | 256k | ✓ | 99.04% | 76.38% | 95.20% | 95.69% | 74.05% | 98.26% | 74.75% | 98.79% |
| Gemini 3 Pro (Preview) | 88.79% | Nov 18, 25 | 1m | ✓ | 99.98% | 77.77% | 89.64% | 96.14% | 95.24% | 98.86% | 64.47% | 88.23% |
| Gemini 2.5 Pro | 88.44% | Jun 17, 25 | 1m | ✓ | 100.00% | 81.03% | 92.57% | 92.18% | 89.21% | 98.58% | 60.89% | 93.05% |
| Gemini 3.1 Flash Lite (Reasoning) | 85.91% | May 7, 26 | 1m | ✓ | 99.81% | 76.31% | 96.70% | 92.32% | 75.32% | 96.90% | 62.26% | 87.63% |
| Gemini 3.5 Flash (Reasoning, Minimal) | 85.88% | May 19, 26 | 1m | – | 100.00% | 78.55% | 97.50% | 83.90% | 78.93% | 98.26% | 60.38% | 89.55% |
| Gemini 3 Flash (Preview) | 85.47% | Dec 17, 25 | 1m | – | 98.04% | 75.04% | 95.00% | 86.39% | 80.99% | 97.54% | 65.14% | 85.61% |
| Gemini 3.1 Flash Lite (Preview) | 85.41% | Feb 19, 26 | 1m | – | 99.98% | 75.78% | 94.98% | 94.00% | 75.75% | 96.46% | 59.04% | 87.29% |
| Gemma 4 31B | 85.23% | Apr 3, 26 | 256k | – | 100.00% | 75.59% | 75.00% | 86.69% | 78.18% | 98.56% | 72.72% | 95.09% |
| Gemini 3.1 Flash Lite | 85.09% | May 7, 26 | 1m | – | 99.97% | 76.01% | 90.90% | 92.77% | 74.62% | 97.35% | 61.21% | 87.86% |
| Gemma 4 26B | 84.89% | Apr 3, 26 | 256k | – | 91.94% | 75.17% | 92.02% | 83.17% | 73.83% | 97.04% | 71.75% | 94.22% |
| Gemini 2.5 Flash (Reasoning) | 84.14% | Jun 17, 25 | 1m | ✓ | 100.00% | 76.30% | 86.06% | 82.25% | 72.63% | 98.12% | 59.97% | 97.79% |
| Gemini 2.5 Flash Lite (Reasoning) | 83.10% | Jul 22, 25 | 1m | ✓ | 98.28% | 71.64% | 74.36% | 89.63% | 72.51% | 94.54% | 66.81% | 96.99% |
| Gemini 2.5 Flash | 80.61% | Jun 17, 25 | 1m | – | 99.97% | 77.57% | 86.23% | 61.45% | 79.09% | 97.83% | 57.47% | 85.27% |
| Gemini 2.5 Flash Lite | 79.91% | Jul 22, 25 | 1m | – | 96.00% | 75.05% | 82.75% | 80.14% | 65.19% | 92.13% | 59.96% | 88.08% |
| Gemma 3 12B | 76.07% | Mar 12, 25 | 128k | – | 94.16% | 75.38% | 80.10% | 79.28% | 56.28% | 85.18% | 61.05% | 77.13% |
| Gemma 3 27B | 75.70% | Mar 12, 25 | 128k | – | 96.15% | 78.79% | 77.21% | 76.82% | 61.51% | 86.63% | 47.98% | 80.53% |
| Gemma 3 4B | 66.33% | Mar 12, 25 | 128k | – | 92.40% | 72.10% | 72.28% | 60.30% | 52.86% | 78.38% | 26.37% | 75.94% |
Model Performance
Cost vs Performance
Compares total benchmark cost against overall score for Google models. Quadrant lines are drawn at the median values.
1 low-scoring outlier hidden: Gemma 3 4B (66.3%).
Cost Breakdown
Total benchmark cost per model, broken down by input, reasoning, and output tokens. Toggle between USD and token views.