Comparing 13 models from Google.
| Model | Total ▼ | Released | Context | CoT | Tooling | Creative Writing | Language | Utility | Reasoning | Text Editing | Rule Following | Hallucination |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 94.37% | Feb 19, 26 | 1m | ✓ | 99.90% | 85.44% | 94.90% | 99.91% | 96.01% | 98.51% | 91.21% | 89.06% |
| Gemini 3 Flash (Preview, Reasoning) | 90.50% | Dec 17, 25 | 1m | ✓ | 100.00% | 75.87% | 94.93% | 97.20% | 98.05% | 98.12% | 74.48% | 85.32% |
| Gemini 3 Pro (Preview) | 88.79% | Nov 18, 25 | 1m | ✓ | 99.98% | 77.77% | 89.64% | 96.14% | 95.24% | 98.86% | 64.47% | 88.23% |
| Gemini 2.5 Pro | 88.53% | Jun 17, 25 | 1m | ✓ | 100.00% | 81.03% | 92.57% | 92.18% | 96.91% | 98.58% | 60.89% | 86.11% |
| Gemini 2.5 Flash (Reasoning) | 86.51% | Jun 17, 25 | 1m | ✓ | 100.00% | 76.30% | 86.06% | 82.25% | 93.81% | 98.12% | 59.97% | 95.60% |
| Gemini 3.1 Flash Lite (Preview) | 85.87% | Feb 19, 26 | 1m | – | 99.98% | 75.78% | 94.98% | 94.00% | 92.15% | 96.46% | 59.04% | 74.58% |
| Gemini 2.5 Flash Lite (Reasoning) | 85.75% | Jul 22, 25 | 1m | ✓ | 99.54% | 71.64% | 74.36% | 89.63% | 93.86% | 94.54% | 66.81% | 95.59% |
| Gemini 3 Flash (Preview) | 85.35% | Dec 17, 25 | 1m | – | 97.64% | 75.04% | 95.00% | 86.39% | 94.79% | 97.54% | 65.14% | 71.24% |
| Gemini 2.5 Flash Lite | 81.08% | Jul 22, 25 | 1m | – | 96.60% | 75.05% | 82.75% | 80.14% | 85.80% | 92.13% | 59.96% | 76.17% |
| Gemini 2.5 Flash | 80.60% | Jun 17, 25 | 1m | – | 99.96% | 77.57% | 86.23% | 61.45% | 92.60% | 97.83% | 57.47% | 71.70% |
| Gemma 3 12B | 78.41% | Mar 12, 25 | 128k | – | 97.69% | 75.38% | 80.10% | 79.28% | 79.42% | 85.18% | 61.05% | 69.15% |
| Gemma 3 27B | 77.85% | Mar 12, 25 | 128k | – | 99.88% | 78.79% | 77.21% | 76.82% | 86.74% | 86.63% | 47.98% | 68.74% |
| Gemma 3 4B | 68.57% | Mar 12, 25 | 128k | – | 97.88% | 72.10% | 72.28% | 60.30% | 73.64% | 78.38% | 26.37% | 67.60% |
Model Performance
Cost vs Performance
Compares total benchmark cost against overall score for Google models. Quadrant lines are drawn at the median values.
1 low-scoring outlier hidden: Gemma 3 4B (68.6%).
Cost Breakdown
Total benchmark cost per model, broken down by input, reasoning, and output tokens. Toggle between USD and token views.