Models

The following models are being used for testing. Some models might appear multiple times with different parameters to test the impact on the results.

Model Total â–¼ Released Context CoTTooling Creative Writing Language Utility Reasoning Text Editing Rule Following Hallucination
Claude Opus 4.6 (Reasoning)95.02%Feb 4, 261m✓100.00%84.55%96.12%98.93%93.77%98.86%89.78%98.13%
Qwen3.6 Max Preview94.54%Apr 27, 26262.1k✓100.00%88.42%100.00%98.34%96.51%98.58%82.79%91.72%
Gemini 3.1 Pro (Preview)94.37%Feb 19, 261m✓99.90%85.44%94.90%99.91%96.01%98.51%91.21%89.06%
Z.AI GLM 5.194.37%Apr 7, 26200k✓100.00%84.05%91.57%97.51%96.83%98.90%88.41%97.67%
Z.AI GLM 5 Turbo94.27%Mar 15, 26200k✓93.95%84.66%99.90%96.36%95.67%98.17%86.78%98.64%
Claude Sonnet 4.6 (Reasoning)93.66%Feb 17, 261m✓100.00%83.09%97.58%97.88%92.76%98.30%85.73%93.96%
GPT-5.4 (Reasoning)93.24%Mar 5, 261m✓100.00%91.17%94.90%96.89%94.78%98.42%79.29%90.43%
Claude Opus 4.7 (Reasoning)93.23%Apr 16, 261m✓100.00%84.73%98.77%97.87%95.44%97.58%74.04%97.39%
GPT-5.5 (Reasoning)92.98%Apr 24, 261m✓100.00%90.26%99.69%96.60%94.89%98.79%79.40%84.24%
GPT-5 Mini92.62%Aug 7, 25400k✓99.99%80.48%96.49%98.39%94.36%97.13%76.44%97.71%
GPT-5.5 (Reasoning, Low)92.59%Apr 24, 261m✓100.00%90.24%99.24%96.36%95.01%98.59%76.90%84.41%
GPT-5.192.54%Nov 13, 25400k✓98.00%87.20%93.64%95.33%95.14%98.54%74.05%98.44%
Claude Opus 4.692.35%Feb 4, 261m–100.00%83.59%96.13%90.72%93.33%98.35%83.11%93.60%
MoonshotAI: Kimi K2.692.31%Apr 20, 26256k✓99.77%85.47%96.77%97.42%95.22%98.00%76.94%88.85%
GPT-591.93%Aug 7, 25400k✓100.00%86.87%91.50%93.53%95.67%98.90%77.13%91.84%
Qwen 3.5 397B A17B91.73%Feb 15, 26128k✓99.77%86.93%95.01%97.50%95.06%98.05%79.39%82.10%
Qwen 3.5 122B91.53%Feb 25, 26262k✓100.00%83.02%95.01%96.36%94.93%96.31%80.00%86.58%
Qwen 3.5 Plus (2026-04-20)91.51%Apr 20, 261m–96.00%85.18%97.14%96.42%94.99%97.70%67.53%97.13%
Grok 4.20 (Beta, Reasoning)91.49%Mar 12, 262m✓100.00%84.50%99.08%95.41%82.64%98.69%75.31%96.28%
GPT-5.4 (Reasoning, Low)91.41%Mar 5, 261m✓99.96%90.51%90.79%95.32%94.34%98.01%70.02%92.29%
Z.AI GLM 591.23%Feb 11, 26200k✓100.00%83.63%92.06%94.11%95.89%98.59%67.78%97.74%
Claude Sonnet 4.691.15%Feb 17, 261m–100.00%83.31%100.00%88.52%88.48%96.37%82.50%89.98%
MoonshotAI: Kimi K2.591.04%Jan 27, 26262k✓100.00%81.35%97.10%96.63%95.41%97.79%72.03%88.02%
Qwen 3.5 27B90.85%Feb 25, 26262k✓99.00%82.54%95.52%95.67%92.73%98.69%76.04%86.58%
ByteDance Seed 1.690.70%Dec 23, 25256k✓98.00%78.43%95.63%90.83%91.49%98.40%77.71%95.14%
Qwen 3.6 Flash90.65%Apr 27, 261m✓99.73%86.02%89.33%96.09%94.20%96.09%71.50%92.26%
GPT-5.4 Mini (Reasoning)90.65%Mar 17, 26400k✓100.00%88.66%98.12%94.44%94.56%95.78%57.38%96.22%
Gemini 3 Flash (Preview, Reasoning)90.50%Dec 17, 251m✓100.00%75.87%94.93%97.20%98.05%98.12%74.48%85.32%
o4 Mini High90.29%Apr 16, 25200k✓100.00%82.72%79.76%98.67%95.02%94.36%72.70%99.06%
GPT-5.290.26%Dec 10, 25400k–100.00%80.36%91.19%96.22%94.54%97.54%67.10%95.09%
DeepSeek V4 Pro (Reasoning)90.10%Apr 24, 261m✓99.28%82.99%88.51%93.24%94.61%98.56%72.74%90.88%
Claude Opus 4.789.93%Apr 16, 261m–99.38%84.74%92.32%95.77%95.72%97.55%68.08%85.89%
Qwen 3.6 27B89.72%Apr 27, 26262.1k✓97.49%82.81%89.01%94.31%93.08%93.97%71.37%95.73%
Claude Opus 4.589.69%Nov 24, 25200k–100.00%81.71%99.66%89.84%93.93%97.69%72.61%82.06%
Grok 4.1 Fast89.55%Nov 19, 252m✓100.00%82.14%88.76%84.12%93.58%97.87%70.87%99.02%
Aion 2.089.21%Feb 23, 26131k✓99.53%80.24%96.17%90.91%94.13%95.34%63.77%93.57%
Z.AI GLM 4.689.11%Sep 30, 25200k✓99.99%78.86%96.60%88.58%95.12%97.78%65.85%90.08%
MiniMax M2.789.10%Mar 18, 26204.8k✓99.98%81.70%84.80%95.50%93.28%92.14%68.90%96.47%
GPT-5.589.09%Apr 24, 261m–100.00%90.39%94.15%81.88%95.16%98.20%72.34%80.57%
Qwen 3.6 35B89.05%Apr 27, 26262.1k✓80.00%85.97%93.56%96.20%94.08%95.10%77.34%90.19%
DeepSeek V4 Flash (Reasoning)89.01%Apr 24, 261m✓96.00%83.03%94.76%87.53%95.08%96.15%64.50%95.02%
Gemini 3 Pro (Preview)88.79%Nov 18, 251m✓99.98%77.77%89.64%96.14%95.24%98.86%64.47%88.23%
Claude Sonnet 488.72%May 22, 25200k–100.00%79.21%91.31%84.02%94.48%99.13%81.52%80.08%
MiniMax M2.588.71%Feb 12, 26196k✓97.93%81.21%96.05%90.42%92.42%96.02%62.69%92.94%
Z.AI GLM 4.788.69%Dec 22, 25200k✓100.00%78.89%85.46%94.31%94.99%98.22%69.16%88.47%
GPT-4.188.68%Apr 14, 251m–98.86%81.24%93.91%90.57%88.46%94.40%66.78%95.24%
Gemini 2.5 Pro88.53%Jun 17, 251m✓100.00%81.03%92.57%92.18%96.91%98.58%60.89%86.11%
o4 Mini88.35%Apr 16, 25200k✓100.00%82.04%80.00%96.31%94.45%90.61%64.61%98.75%
Grok 488.12%Jul 9, 25256k✓99.99%77.34%90.61%89.67%96.01%98.76%63.09%89.45%
Claude Sonnet 4.588.03%Sep 29, 251m–100.00%84.19%92.39%83.78%92.50%99.02%76.80%75.57%
Qwen 3.5 35B88.00%Feb 25, 26262k✓93.98%83.51%91.95%96.42%94.88%94.95%67.42%80.87%
Claude Opus 487.69%May 22, 25200k–100.00%83.79%93.01%88.81%92.59%97.25%70.37%75.68%
Xiaomi MIMO v2.5 Pro87.36%Apr 22, 261m✓100.00%81.08%87.69%82.62%92.07%95.96%64.29%95.17%
Stealth: Hunter Alpha87.34%Mar 11, 261m–99.99%79.18%93.35%84.63%91.67%95.53%63.63%90.78%
ByteDance Seed 2.0 Mini86.91%Feb 26, 26262k✓99.64%80.11%90.12%91.88%92.40%91.08%58.77%91.29%
Gemini 2.5 Flash (Reasoning)86.51%Jun 17, 251m✓100.00%76.30%86.06%82.25%93.81%98.12%59.97%95.60%
GPT-OSS 120B86.44%Aug 5, 25131k✓100.00%67.85%97.18%92.03%92.42%91.73%55.03%95.29%
Qwen 3.5 Flash86.38%Feb 25, 261m✓87.87%83.81%91.94%96.11%94.66%92.80%63.19%80.63%
Z.AI GLM 4.586.27%Jul 25, 25131k–99.89%76.56%97.33%79.19%91.03%95.32%63.79%87.05%
Grok 4 Fast86.15%Sep 19, 252m✓99.65%77.03%84.61%76.76%94.89%97.26%67.91%91.09%
Qwen 3.5 9B86.05%Mar 10, 26262k✓97.00%84.35%88.18%94.02%92.93%85.35%60.98%85.58%
Qwen 3.5 Plus (2026-02-15)85.96%Feb 15, 261m–99.74%77.07%95.10%86.65%93.45%98.10%64.21%73.35%
Stealth: Healer Alpha85.93%Mar 11, 26262k–100.00%78.28%88.45%82.30%91.67%96.04%56.03%94.67%
Gemini 3.1 Flash Lite (Preview)85.87%Feb 19, 261m–99.98%75.78%94.98%94.00%92.15%96.46%59.04%74.58%
GPT-5.4 Mini (Reasoning, Low)85.75%Mar 17, 26400k✓100.00%87.72%92.45%88.49%92.28%92.63%33.99%98.43%
Gemini 2.5 Flash Lite (Reasoning)85.75%Jul 22, 251m✓99.54%71.64%74.36%89.63%93.86%94.54%66.81%95.59%
Mistral Large 385.43%Dec 1, 25262k–99.66%81.21%92.02%84.91%88.95%94.09%64.41%78.17%
GPT-4o, May 13th (temp=0)85.36%May 13, 24128k–99.22%74.89%98.72%83.13%88.58%95.35%73.24%69.76%
Gemini 3 Flash (Preview)85.35%Dec 17, 251m–97.64%75.04%95.00%86.39%94.79%97.54%65.14%71.24%
Claude Haiku 4.585.14%Oct 15, 25200k–99.10%78.96%91.84%72.48%87.76%96.81%70.35%83.86%
Xiaomi MIMO v2.585.05%Apr 22, 261m✓99.96%79.16%76.33%81.15%92.43%95.34%60.75%95.25%
DeepSeek-V2 Chat84.83%May 6, 24128k–99.76%77.20%100.00%83.82%88.70%90.90%68.78%69.48%
Z.AI GLM 4.7 Flash84.82%Jan 19, 26200k✓93.74%77.36%87.67%88.98%89.50%85.82%65.63%89.86%
ByteDance Seed 2.0 Lite84.80%Mar 10, 26262k✓99.91%82.35%96.80%92.23%95.50%95.03%36.85%79.75%
Nemotron 3 Super84.56%Mar 11, 26262k–93.49%69.75%81.41%95.29%93.11%86.34%57.43%99.69%
GPT-5.484.32%Mar 5, 261m–97.65%90.94%81.49%81.95%93.92%96.73%58.11%73.78%
Claude 3.5 Sonnet84.24%Jun 20, 24200k–100.00%78.69%85.62%76.75%90.30%96.57%69.67%76.31%
Grok 4.20 (Beta)83.85%Mar 12, 262m–100.00%82.80%91.17%82.15%87.05%95.49%53.89%78.28%
Inception Mercury 283.85%Mar 4, 26128k–98.00%68.31%87.32%92.86%92.03%85.26%54.41%92.60%
GPT-4o, May 13th (temp=1)83.80%May 13, 24128k–99.68%75.88%92.52%80.69%85.98%92.41%69.88%73.40%
Stealth: Aurora Alpha83.79%Feb 9, 26128k–99.69%67.54%92.50%92.59%90.11% – 44.19%99.93%
DeepSeek V3 (2024-12-26)83.68%Dec 26, 24163.8k–100.00%77.88%87.88%81.87%88.71%93.58%66.39%73.11%
Claude 3.7 Sonnet83.39%Feb 19, 25200k–99.32%76.31%92.95%62.54%89.94%97.12%73.78%75.18%
GPT-4.1 Mini83.20%Apr 14, 251m–97.92%74.52%89.64%82.30%85.83%95.62%58.59%81.14%
Z.AI GLM 4.5 Air83.12%Jul 25, 25131k–99.60%74.61%95.05%76.57%87.91%94.38%44.11%92.74%
Hermes 3 405B82.86%Aug 15, 24128k–99.78%80.92%99.57%69.02%85.58%89.14%59.17%79.70%
DeepSeek V4 Pro82.63%Apr 24, 261m–99.49%83.70%72.80%77.57%87.07%97.98%63.74%78.72%
GPT-4o, Aug. 6th (temp=1)82.62%Aug 6, 24128k–99.73%75.50%82.21%82.44%86.91%86.72%67.91%79.53%
GPT-5 Nano82.60%Aug 7, 25400k✓99.21%67.04%77.18%93.91%89.61%82.74%57.57%93.52%
GPT-4o, Aug. 6th (temp=0)82.45%Aug 6, 24128k–99.95%73.65%75.00%82.11%87.59%93.77%74.19%73.35%
GPT-5.4 Mini82.43%Mar 17, 26400k–99.86%88.10%88.75%79.37%88.04%90.60%46.32%78.40%
Mistral Large 282.41%Jul 24, 24128k–99.78%81.86%85.22%69.19%88.20%94.16%63.05%77.87%
Mistral Small 4 (Reasoning)82.39%Mar 16, 26265k✓99.73%81.67%60.53%85.61%87.78%90.58%60.28%92.98%
DeepSeek V3.182.39%Aug 21, 25163.8k–97.96%77.45%96.87%76.65%83.95%87.27%66.15%72.80%
DeepSeek V3.282.25%Dec 1, 25163.8k✓99.99%79.95%85.01%81.58%89.46%95.78%53.75%72.50%
Qwen 3 32B82.21%Apr 28, 2541k–97.43%81.30%84.61%81.66%86.35%89.95%46.83%89.56%
DeepSeek V4 Flash82.02%Apr 24, 261m–95.80%83.42%88.50%83.26%79.16%93.25%57.32%75.47%
DeepSeek V3 (2025-03-24)81.99%Mar 24, 25163.8k–93.53%82.34%86.42%80.62%88.45%89.57%67.94%67.07%
GPT-5.4 Nano (Reasoning)81.36%Mar 17, 26400k✓95.71%80.97%83.99%93.34%88.48%83.32%27.15%97.95%
Gemini 2.5 Flash Lite81.08%Jul 22, 251m–96.60%75.05%82.75%80.14%85.80%92.13%59.96%76.17%
Gemini 2.5 Flash80.60%Jun 17, 251m–99.96%77.57%86.23%61.45%92.60%97.83%57.47%71.70%
Mistral Large80.15%Feb 26, 2432k–98.67%82.02%88.64%73.04%76.31%95.14%49.87%77.50%
Qwen3 235B A22B Instruct 250780.10%Jul 21, 25262.1k–99.23%84.81%60.83%83.15%85.82%91.75%65.42%69.82%
Writer: Palmyra X579.57%Jan 26, 261m–99.34%83.95%56.58%79.71%86.57%91.20%67.19%72.04%
Inception Mercury79.50%Jun 25, 25128k–97.98%69.99%80.37%87.38%85.96%79.53%39.68%95.08%
GPT-5.4 Nano (Reasoning, Low)79.48%Mar 17, 26400k✓89.75%80.93%81.87%91.42%78.93%82.23%31.65%99.03%
GPT-4o Mini (temp=1)79.08%Jul 18, 24128k–99.26%74.37%77.50%82.16%80.28%85.78%56.50%76.78%
Mistral Small 3.2 24B78.58%Jun 20, 25131k–99.89%71.87%72.77%73.17%81.71%89.48%64.08%75.64%
Gemma 3 12B78.41%Mar 12, 25128k–97.69%75.38%80.10%79.28%79.42%85.18%61.05%69.15%
Llama 3.1 70B78.40%Jul 23, 24128k–88.59%72.78%80.18%81.03%79.31%92.10%63.45%69.78%
GPT-4o Mini (temp=0)78.29%Jul 18, 24128k–98.54%73.10%75.00%81.43%81.26%84.62%58.84%73.56%
Gemma 3 27B77.85%Mar 12, 25128k–99.88%78.79%77.21%76.82%86.74%86.63%47.98%68.74%
Mistral Medium 3.177.83%Aug 13, 25131k–97.50%81.70%49.50%80.13%89.32%93.77%48.60%82.09%
Nemotron 3 Nano77.73%Dec 14, 25262k–83.98%65.87%87.63%86.00%89.91%75.81%43.47%89.16%
Mistral Small 476.46%Mar 16, 26265k–95.02%81.12%51.96%78.28%78.72%91.00%62.17%73.41%
Qwen 2.5 72B75.46%Sep 19, 24131.1k–99.38%75.16%68.95%76.43%83.43%89.18%31.55%79.56%
Llama 3.1 Nemotron 70B74.70%Oct 15, 24128k–95.74%71.71%46.80%88.31%82.19%87.26%50.62%74.99%
GPT-5.4 Nano74.40%Mar 17, 26400k–90.22%80.50%80.82%78.57%75.66%79.22%20.94%89.29%
Arcee AI: Trinity Large (Preview)73.33%Jan 27, 26128k–93.42%75.26%78.38%60.74%77.24%86.62%38.52%76.47%
ByteDance Seed 1.6 Flash73.27%Dec 23, 25256k✓39.46%81.51%61.23%84.16%86.52%91.64%47.15%94.52%
Mistral Small Creative73.27%Dec 16, 2532k–86.85%80.29%41.85%76.28%87.99%90.31%48.15%74.46%
Hermes 3 70B72.57%Aug 15, 24128k–97.86%77.41%81.66%61.15%79.08%63.34%53.00%67.03%
Ministral 3 14B72.54%Dec 2, 25262k–91.91%79.11%30.00%79.03%83.24%86.20%50.83%79.99%
GPT-4.1 Nano71.94%Apr 14, 251m–81.37%71.81%78.95%68.45%70.24%76.06%40.88%87.73%
Ministral 3 8B71.76%Dec 2, 25262k–99.42%77.26%48.96%74.43%71.64%78.52%31.34%92.52%
Claude 3 Haiku71.19%Mar 13, 24200k–99.47%74.53%72.76%68.47%77.94%64.36%51.15%60.81%
WizardLM 2 8x22b71.06%Apr 15, 2465k–90.27%79.06%78.05%67.14%67.36%88.13%28.27%70.19%
Arcee AI: Trinity Mini70.90%Dec 1, 25128k✓97.16%74.01%70.59%59.94%76.94%73.88%23.57%91.12%
Cohere Command R+ (Aug. 2024)69.03%Aug 31, 24128k–91.84%77.70%66.58%59.51%65.10%68.40%58.70%64.40%
Gemma 3 4B68.57%Mar 12, 25128k–97.88%72.10%72.28%60.30%73.64%78.38%26.37%67.60%
Ministral 3 3B67.22%Dec 2, 25131k–93.79%75.45%68.10%72.38%71.88%69.80%15.87%70.45%
Mistral NeMO65.04%Jul 18, 24128k–83.21%76.72%80.80%51.55%57.59%73.69%34.11%62.63%
Ministral 8B64.87%Oct 16, 24128k–85.58%76.87%53.91%46.82%73.78%77.52%15.27%89.19%
Llama 3.1 8B63.35%Jul 23, 24128k–69.49%76.54%64.06%74.82%69.12%75.45%34.03%43.27%
Ministral 3B61.29%Oct 16, 24128k–87.64%75.49%42.25%49.17%69.70%70.91%24.45%70.75%
LFM2 24B58.77%Feb 25, 2632k–16.85%78.10%64.64%69.48%54.88%71.56%24.12%90.53%
Rocinante 12B54.54%Sep 30, 2432k–38.30%81.94%63.45%48.47%54.31%56.31%41.51%52.07%
Model Performance
Cost vs Performance

Compares total benchmark cost against overall score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.

6 low-scoring outliers hidden: Mistral NeMO (65.0%), Ministral 8B (64.9%), Llama 3.1 8B (63.3%), Ministral 3B (61.3%), LFM2 24B (58.8%), Rocinante 12B (54.5%).

Cost Breakdown

Total benchmark cost per model, broken down by input, reasoning, and output tokens. Toggle between USD and token views. Only models with available cost data are shown.