Models

The following models are being used for testing. Some models might appear multiple times with different parameters to test the impact on the results.

Model Total ▼ Released Context CoTTooling Creative Writing Language Utility Reasoning Text Editing Rule Following Hallucination
Claude Opus 4.6 (Reasoning)95.02%Feb 4, 261m100.00%84.55%96.12%98.93%93.77%98.86%89.78%98.13%
Gemini 3.1 Pro (Preview)94.37%Feb 19, 261m99.90%85.44%94.90%99.91%96.01%98.51%91.21%89.06%
Claude Sonnet 4.6 (Reasoning)93.66%Feb 17, 261m100.00%83.09%97.58%97.88%92.76%98.30%85.73%93.96%
GPT-5.4 (Reasoning)93.24%Mar 5, 261m100.00%91.17%94.90%96.89%94.78%98.42%79.29%90.43%
GPT-5 Mini92.62%Aug 7, 25400k99.99%80.48%96.49%98.39%94.36%97.13%76.44%97.71%
GPT-5.192.54%Nov 13, 25400k98.00%87.20%93.64%95.33%95.14%98.54%74.05%98.44%
Claude Opus 4.692.35%Feb 4, 261m100.00%83.59%96.13%90.72%93.33%98.35%83.11%93.60%
GPT-591.93%Aug 7, 25400k100.00%86.87%91.50%93.53%95.67%98.90%77.13%91.84%
Qwen 3.5 397B A17B91.73%Feb 15, 26128k99.77%86.93%95.01%97.50%95.06%98.05%79.39%82.10%
Qwen 3.5 122B91.53%Feb 25, 26262k100.00%83.02%95.01%96.36%94.93%96.31%80.00%86.58%
Grok 4.20 (Beta, Reasoning)91.49%Mar 12, 262m100.00%84.50%99.08%95.41%82.64%98.69%75.31%96.28%
GPT-5.4 (Reasoning, Low)91.41%Mar 5, 261m99.96%90.51%90.79%95.32%94.34%98.01%70.02%92.29%
Z.AI GLM 591.23%Feb 11, 26200k100.00%83.63%92.06%94.11%95.89%98.59%67.78%97.74%
Claude Sonnet 4.691.15%Feb 17, 261m100.00%83.31%100.00%88.52%88.48%96.37%82.50%89.99%
MoonshotAI: Kimi K2.591.04%Jan 27, 26262k100.00%81.35%97.10%96.63%95.41%97.79%72.03%88.02%
Qwen 3.5 27B90.85%Feb 25, 26262k99.00%82.54%95.52%95.67%92.73%98.69%76.04%86.58%
ByteDance Seed 1.690.70%Dec 23, 25256k98.00%78.43%95.63%90.83%91.49%98.40%77.71%95.14%
Gemini 3 Flash (Preview, Reasoning)90.50%Dec 17, 251m100.00%75.87%94.93%97.20%98.05%98.12%74.48%85.32%
o4 Mini High90.29%Apr 16, 25200k100.00%82.72%79.76%98.67%95.02%94.36%72.70%99.06%
GPT-5.290.26%Dec 10, 25400k100.00%80.36%91.19%96.22%94.54%97.54%67.10%95.09%
Claude Opus 4.589.69%Nov 24, 25200k100.00%81.71%99.66%89.84%93.93%97.69%72.61%82.06%
Grok 4.1 Fast89.55%Nov 19, 252m100.00%82.14%88.76%84.12%93.58%97.87%70.87%99.02%
Aion 2.089.21%Feb 23, 26131k99.53%80.24%96.17%90.91%94.13%95.34%63.77%93.57%
Z.AI GLM 4.689.11%Sep 30, 25200k99.99%78.86%96.60%88.58%95.12%97.78%65.85%90.08%
Gemini 3 Pro (Preview)88.79%Nov 18, 251m99.98%77.77%89.64%96.14%95.24%98.86%64.47%88.23%
Claude Sonnet 488.72%May 22, 25200k100.00%79.21%91.31%84.02%94.48%99.13%81.52%80.08%
Minimax M2.588.71%Feb 12, 26196k97.93%81.21%96.05%90.42%92.42%96.02%62.69%92.94%
Z.AI GLM 4.788.69%Dec 22, 25200k100.00%78.89%85.46%94.31%94.99%98.22%69.16%88.47%
GPT-4.188.68%Apr 14, 251m98.86%81.24%93.91%90.57%88.46%94.40%66.78%95.24%
Gemini 2.5 Pro88.53%Jun 17, 251m100.00%81.03%92.57%92.18%96.91%98.58%60.89%86.11%
o4 Mini88.35%Apr 16, 25200k100.00%82.04%80.00%96.31%94.45%90.61%64.61%98.75%
Grok 488.12%Jul 9, 25256k99.99%77.34%90.61%89.67%96.01%98.76%63.09%89.45%
Claude Sonnet 4.588.03%Sep 29, 251m100.00%84.19%92.39%83.78%92.50%99.02%76.80%75.57%
Qwen 3.5 35B88.00%Feb 25, 26262k93.98%83.51%91.95%96.42%94.88%94.95%67.42%80.87%
Claude Opus 487.69%May 22, 25200k100.00%83.79%93.01%88.81%92.59%97.25%70.37%75.68%
Stealth: Hunter Alpha87.34%Mar 11, 261m99.99%79.18%93.35%84.63%91.67%95.53%63.63%90.78%
ByteDance Seed 2.0 Mini86.91%Feb 26, 26262k99.64%80.11%90.12%91.88%92.40%91.08%58.77%91.33%
Gemini 2.5 Flash (Reasoning)86.51%Jun 17, 251m100.00%76.30%86.06%82.25%93.81%98.12%59.97%95.60%
Qwen 3.5 Flash86.38%Feb 25, 261m87.87%83.81%91.94%96.11%94.66%92.80%63.19%80.63%
Z.AI GLM 4.586.27%Jul 25, 25131k99.89%76.56%97.33%79.19%91.03%95.32%63.79%87.05%
Grok 4 Fast86.15%Sep 19, 252m99.65%77.03%84.61%76.76%94.89%97.26%67.91%91.09%
Qwen 3.5 9B86.05%Mar 10, 26262k97.00%84.35%88.18%94.02%92.93%85.35%60.98%85.58%
Qwen 3.5 Plus (2026-02-15)85.96%Feb 15, 261m99.74%77.07%95.10%86.65%93.45%98.10%64.21%73.35%
Stealth: Healer Alpha85.93%Mar 11, 26262k100.00%78.28%88.45%82.30%91.67%96.04%56.03%94.67%
Gemini 3.1 Flash Lite (Preview)85.87%Feb 19, 261m99.98%75.78%94.98%94.00%92.15%96.46%59.04%74.58%
Gemini 2.5 Flash Lite (Reasoning)85.75%Jul 22, 251m99.54%71.64%74.36%89.63%93.86%94.54%66.81%95.59%
Mistral Large 385.43%Dec 1, 25262k99.66%81.21%92.02%84.91%88.95%94.09%64.41%78.17%
GPT-4o, May 13th (temp=0)85.36%May 13, 24128k99.22%74.89%98.72%83.13%88.58%95.35%73.24%69.76%
Gemini 3 Flash (Preview)85.35%Dec 17, 251m97.64%75.04%95.00%86.39%94.79%97.54%65.14%71.24%
Claude Haiku 4.585.14%Oct 15, 25200k99.10%78.96%91.84%72.48%87.76%96.81%70.35%83.86%
DeepSeek-V2 Chat84.83%May 6, 24128k99.76%77.20%100.00%83.82%88.70%90.90%68.78%69.48%
Z.AI GLM 4.7 Flash84.82%Jan 19, 26200k93.74%77.36%87.67%88.98%89.50%85.82%65.63%89.86%
ByteDance Seed 2.0 Lite84.80%Mar 10, 26262k99.91%82.35%96.80%92.23%95.50%95.03%36.85%79.75%
Nemotron 3 Super84.56%Mar 11, 26262k93.49%69.75%81.41%95.29%93.11%86.34%57.43%99.69%
GPT-5.484.32%Mar 5, 261m97.65%90.94%81.49%81.95%93.92%96.73%58.11%73.78%
Claude 3.5 Sonnet84.24%Jun 20, 24200k100.00%78.69%85.62%76.75%90.30%96.57%69.67%76.31%
Grok 4.20 (Beta)83.85%Mar 12, 262m100.00%82.80%91.17%82.15%87.05%95.49%53.89%78.28%
Inception Mercury 283.85%Mar 4, 26128k98.00%68.31%87.32%92.86%92.03%85.26%54.41%92.60%
GPT-4o, May 13th (temp=1)83.80%May 13, 24128k99.68%75.88%92.52%80.69%85.98%92.41%69.88%73.40%
Stealth: Aurora Alpha83.79%Feb 9, 26128k99.69%67.54%92.50%92.59%90.11%44.19%99.93%
Claude 3.5 Haiku83.73%Oct 22, 24200k99.69%75.28%82.12%82.57%82.23%64.18%100.00%
DeepSeek V3 (2024-12-26)83.68%Dec 26, 24163.8k100.00%77.88%87.88%81.87%88.71%93.58%66.39%73.11%
Claude 3.7 Sonnet83.39%Feb 19, 25200k99.32%76.31%92.95%62.54%89.94%97.12%73.78%75.18%
GPT-4.1 Mini83.20%Apr 14, 251m97.92%74.52%89.64%82.30%85.83%95.62%58.59%81.14%
Hermes 3 405B82.86%Aug 15, 24128k99.78%80.92%99.57%69.02%85.58%89.14%59.17%79.70%
GPT-4o, Aug. 6th (temp=1)82.62%Aug 6, 24128k99.73%75.50%82.21%82.44%86.91%86.72%67.91%79.53%
GPT-5 Nano82.60%Aug 7, 25400k99.21%67.04%77.18%93.91%89.61%82.74%57.57%93.53%
GPT-4o, Aug. 6th (temp=0)82.45%Aug 6, 24128k99.95%73.65%75.00%82.11%87.59%93.77%74.19%73.35%
Mistral Large 282.41%Jul 24, 24128k99.78%81.86%85.22%69.19%88.20%94.16%63.05%77.87%
DeepSeek V3.182.39%Aug 21, 25163.8k97.96%77.45%96.87%76.65%83.95%87.27%66.15%72.80%
DeepSeek V3.282.25%Dec 1, 25163.8k99.99%79.95%85.01%81.58%89.46%95.78%53.75%72.50%
DeepSeek V3 (2025-03-24)81.99%Mar 24, 25163.8k93.53%82.34%86.42%80.62%88.45%89.57%67.94%67.07%
Gemini 2.5 Flash Lite81.08%Jul 22, 251m96.60%75.05%82.75%80.14%85.80%92.13%59.96%76.17%
Gemini 2.5 Flash80.60%Jun 17, 251m99.96%77.57%86.23%61.45%92.60%97.83%57.47%71.70%
Mistral Large80.15%Feb 26, 2432k98.67%82.02%88.64%73.04%76.31%95.14%49.87%77.50%
Writer: Palmyra X579.57%Jan 26, 261m99.34%83.95%56.58%79.71%86.57%91.20%67.19%72.04%
Inception Mercury79.50%Jun 25, 25128k97.98%69.99%80.37%87.38%85.96%79.53%39.68%95.08%
GPT-4o Mini (temp=1)79.08%Jul 18, 24128k99.26%74.37%77.50%82.16%80.28%85.78%56.50%76.78%
Mistral Small 3.2 24B78.60%Jun 20, 25131k99.89%71.87%72.77%73.17%81.71%89.48%64.08%75.83%
Gemma 3 12B78.41%Mar 12, 25128k97.69%75.38%80.10%79.28%79.42%85.18%61.05%69.15%
Llama 3.1 70B78.40%Jul 23, 24128k88.59%72.78%80.18%81.03%79.31%92.10%63.45%69.78%
GPT-4o Mini (temp=0)78.29%Jul 18, 24128k98.54%73.10%75.00%81.43%81.26%84.62%58.84%73.56%
Gemma 3 27B77.85%Mar 12, 25128k99.88%78.79%77.21%76.82%86.74%86.63%47.98%68.74%
Mistral Medium 3.177.83%Aug 13, 25131k97.50%81.70%49.50%80.13%89.32%93.77%48.60%82.09%
Nemotron 3 Nano77.73%Dec 14, 25262k83.98%65.87%87.63%86.00%89.91%75.81%43.47%89.17%
Qwen 2.5 72B75.46%Sep 19, 24131.1k99.38%75.16%68.95%76.43%83.43%89.18%31.55%79.56%
Llama 3.1 Nemotron 70B74.70%Oct 15, 24128k95.74%71.71%46.80%88.31%82.19%87.26%50.62%74.99%
Arcee AI: Trinity Large (Preview)73.33%Jan 27, 26128k93.42%75.26%78.38%60.74%77.24%86.62%38.52%76.47%
ByteDance Seed 1.6 Flash73.27%Dec 23, 25256k39.46%81.51%61.23%84.16%86.52%91.64%47.15%94.52%
Mistral Small Creative73.27%Dec 16, 2532k86.85%80.29%41.85%76.28%87.99%90.31%48.15%74.46%
Hermes 3 70B72.57%Aug 15, 24128k97.86%77.41%81.66%61.15%79.08%63.34%53.00%67.05%
Ministral 3 14B72.54%Dec 2, 25262k91.91%79.11%30.00%79.03%83.24%86.20%50.83%79.99%
GPT-4.1 Nano71.94%Apr 14, 251m81.37%71.81%78.95%68.45%70.24%76.06%40.88%87.73%
Ministral 3 8B71.76%Dec 2, 25262k99.42%77.26%48.96%74.43%71.64%78.52%31.34%92.52%
Claude 3 Haiku71.19%Mar 13, 24200k99.47%74.53%72.76%68.47%77.94%64.36%51.15%60.81%
WizardLM 2 8x22b71.07%Apr 15, 2465k90.27%79.06%78.05%67.14%67.36%88.13%28.27%70.24%
Arcee AI: Trinity Mini70.90%Dec 1, 25128k97.16%74.01%70.59%59.94%76.94%73.88%23.57%91.12%
Cohere Command R+ (Aug. 2024)69.03%Aug 31, 24128k91.84%77.70%66.58%59.51%65.10%68.40%58.70%64.40%
Gemma 3 4B68.57%Mar 12, 25128k97.88%72.10%72.28%60.30%73.64%78.38%26.37%67.60%
Ministral 3 3B67.22%Dec 2, 25131k93.79%75.45%68.10%72.38%71.88%69.80%15.87%70.45%
Mistral NeMO65.04%Jul 18, 24128k83.21%76.72%80.80%51.55%57.59%73.69%34.11%62.63%
Ministral 8B64.87%Oct 16, 24128k85.58%76.87%53.91%46.82%73.78%77.52%15.27%89.19%
Llama 3.1 8B63.37%Jul 23, 24128k69.49%76.54%64.06%74.82%69.12%75.45%34.03%43.41%
Ministral 3B61.29%Oct 16, 24128k87.64%75.49%42.25%49.17%69.70%70.91%24.45%70.75%
LFM2 24B58.77%Feb 25, 2632k16.85%78.10%64.64%69.48%54.88%71.56%24.12%90.53%
Rocinante 12B54.55%Sep 30, 2432k38.30%81.94%63.45%48.47%54.31%56.31%41.51%52.09%
Model Performance
Cost vs Performance

Compares total benchmark cost against overall score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.

3 low-scoring outliers hidden: Ministral 3B (61.3%), LFM2 24B (58.8%), Rocinante 12B (54.5%).

Cost Breakdown

Total benchmark cost per model, broken down by input, reasoning, and output tokens. Toggle between USD and token views. Only models with available cost data are shown.