Google

Comparing 19 models from Google.

Model Total ▼ Released Context CoTTooling Creative Writing Language Utility Reasoning Text Editing Rule Following Hallucination
Gemini 3.1 Pro (Preview)94.37%Feb 19, 261m99.90%85.44%94.90%99.91%96.01%98.51%91.21%89.06%
Gemma 4 31B (Reasoning)91.71%Apr 3, 26256k100.00%78.13%83.82%96.32%97.19%98.83%85.00%94.41%
Gemma 4 26B (Reasoning)91.49%Apr 3, 26256k98.85%76.38%95.20%95.69%95.21%98.26%74.75%97.59%
Gemini 3 Flash (Preview, Reasoning)90.50%Dec 17, 251m100.00%75.87%94.93%97.20%98.05%98.12%74.48%85.32%
Gemini 3 Pro (Preview)88.79%Nov 18, 251m99.98%77.77%89.64%96.14%95.24%98.86%64.47%88.23%
Gemini 2.5 Pro88.53%Jun 17, 251m100.00%81.03%92.57%92.18%96.91%98.58%60.89%86.11%
Gemma 4 31B86.91%Apr 3, 26256k100.00%75.59%75.00%86.69%96.52%98.56%72.72%90.17%
Gemini 2.5 Flash (Reasoning)86.51%Jun 17, 251m100.00%76.30%86.06%82.25%93.81%98.12%59.97%95.60%
Gemini 3.1 Flash Lite (Reasoning)86.41%May 7, 261m99.77%76.31%96.70%92.32%91.72%96.90%62.26%75.26%
Gemini 3.1 Flash Lite (Preview)85.87%Feb 19, 261m99.98%75.78%94.98%94.00%92.15%96.46%59.04%74.58%
Gemma 4 26B85.84%Apr 3, 26256k91.13%75.17%92.02%83.17%88.02%97.04%71.75%88.45%
Gemini 3.1 Flash Lite85.75%May 7, 261m99.97%76.01%90.90%92.77%92.10%97.35%61.21%75.72%
Gemini 2.5 Flash Lite (Reasoning)85.75%Jul 22, 251m99.54%71.64%74.36%89.63%93.86%94.54%66.81%95.59%
Gemini 3 Flash (Preview)85.35%Dec 17, 251m97.64%75.04%95.00%86.39%94.79%97.54%65.14%71.24%
Gemini 2.5 Flash Lite81.08%Jul 22, 251m96.60%75.05%82.75%80.14%85.80%92.13%59.96%76.17%
Gemini 2.5 Flash80.60%Jun 17, 251m99.96%77.57%86.23%61.45%92.60%97.83%57.47%71.70%
Gemma 3 12B78.41%Mar 12, 25128k97.69%75.38%80.10%79.28%79.42%85.18%61.05%69.15%
Gemma 3 27B77.85%Mar 12, 25128k99.88%78.79%77.21%76.82%86.74%86.63%47.98%68.74%
Gemma 3 4B68.57%Mar 12, 25128k97.88%72.10%72.28%60.30%73.64%78.38%26.37%67.60%
Model Performance
Cost vs Performance

Compares total benchmark cost against overall score for Google models. Quadrant lines are drawn at the median values.

1 low-scoring outlier hidden: Gemma 3 4B (68.6%).

Cost Breakdown

Total benchmark cost per model, broken down by input, reasoning, and output tokens. Toggle between USD and token views.