Models

The following models are being used for testing. Some models might appear multiple times with different parameters to test the impact on the results.

Model Total â–¼ Released Context CoTTooling Creative Writing Language Utility Reasoning Text Editing Rule Following Hallucination
Claude Opus 4.6 (Reasoning)95.06%Feb 4, 261m✓100.00%84.55%96.12%98.93%93.19%98.86%89.78%99.06%
Qwen3.7 Max94.55%May 21, 261m✓100.00%85.39%97.05%99.54%83.47%98.08%95.76%97.15%
Gemini 3.1 Pro (Preview)94.08%Feb 19, 261m✓99.92%85.44%94.90%99.91%88.20%98.51%91.21%94.53%
GPT-5.4 (Reasoning)93.85%Mar 5, 261m✓100.00%91.17%94.90%96.89%94.89%98.42%79.29%95.22%
Z.AI GLM 5.193.74%Apr 7, 26200k✓100.00%84.05%91.57%97.51%90.64%98.90%88.41%98.84%
Qwen3.6 Max Preview93.72%Apr 27, 26262.1k✓100.00%88.42%100.00%98.34%85.79%98.58%82.79%95.86%
GPT-5.5 (Reasoning)93.72%Apr 24, 261m✓100.00%90.26%99.69%96.60%92.87%98.79%79.40%92.12%
Claude Sonnet 4.6 (Reasoning)93.64%Feb 17, 261m✓100.00%83.09%97.58%97.88%89.61%98.30%85.73%96.98%
Gemini 3.5 Flash (Reasoning)93.35%May 19, 261m✓100.00%79.87%94.41%98.86%88.22%97.78%92.04%95.61%
Z.AI GLM 5 Turbo93.29%Mar 15, 26200k✓94.30%84.66%99.90%96.36%86.87%98.17%86.78%99.32%
MoonshotAI: Kimi K2.692.57%Apr 20, 26256k✓99.81%85.47%96.77%97.42%91.69%98.00%76.94%94.43%
Claude Opus 4.7 (Reasoning)92.53%Apr 16, 261m✓100.00%84.73%98.77%97.87%88.52%97.58%74.04%98.69%
GPT-5.5 (Reasoning, Low)92.51%Apr 24, 261m✓98.50%90.24%99.24%96.36%88.87%98.59%76.90%91.37%
Claude Opus 4.8 (Reasoning)92.33%May 27, 261m✓99.53%85.25%96.38%99.26%91.76%98.78%70.27%97.41%
Claude Opus 4.692.31%Feb 4, 261m–100.00%83.59%96.13%90.72%89.75%98.35%83.11%96.80%
Claude Opus 4.8 (Reasoning, Low)91.89%May 27, 261m✓99.48%85.86%96.31%98.00%88.83%98.71%70.56%97.33%
GPT-591.48%Aug 7, 25400k✓98.67%86.87%91.50%93.53%89.34%98.90%77.13%95.92%
GPT-5 Mini91.31%Aug 7, 25400k✓99.99%80.48%96.49%98.39%83.31%97.13%76.44%98.28%
Qwen 3.5 397B A17B91.09%Feb 15, 26128k✓99.81%86.93%95.01%97.50%81.97%98.05%79.39%90.04%
Grok 4.3 (Reasoning)90.99%Apr 30, 261m✓98.47%85.11%97.50%92.94%75.64%97.64%82.80%97.86%
Grok 4.20 (Beta, Reasoning)90.98%Mar 12, 262m✓100.00%84.50%99.08%95.41%76.73%98.69%75.31%98.14%
GPT-5.4 (Reasoning, Low)90.91%Mar 5, 261m✓99.97%90.51%90.79%95.32%86.99%98.01%70.02%95.71%
Grok 4.20 (Reasoning)90.87%Mar 31, 262m✓100.00%86.25%96.61%92.61%74.28%98.83%82.04%96.30%
MoonshotAI: Kimi K2.590.86%Jan 27, 26262k✓100.00%81.35%97.10%96.63%87.99%97.79%72.03%94.01%
GPT-5.190.73%Nov 13, 25400k✓98.33%87.20%93.64%95.33%82.57%98.54%74.05%96.22%
Claude Sonnet 4.690.66%Feb 17, 261m–100.00%83.31%100.00%88.52%79.62%96.37%82.50%94.99%
MiniMax M390.45%May 31, 261m✓99.96%84.57%94.71%93.59%86.27%97.51%69.18%97.83%
Qwen 3.5 122B90.32%Feb 25, 26262k✓99.33%83.02%95.01%96.36%79.24%96.31%80.00%93.29%
Qwen 3.5 27B90.05%Feb 25, 26262k✓99.17%82.54%95.52%95.67%79.44%98.69%76.04%93.29%
Gemini 3 Flash (Preview, Reasoning)89.93%Dec 17, 251m✓100.00%75.87%94.93%97.20%86.16%98.12%74.48%92.65%
Claude Opus 4.789.90%Apr 16, 261m–99.48%84.74%92.32%95.77%88.29%97.55%68.08%92.95%
GPT-5.4 Mini (Reasoning)89.82%Mar 17, 26400k✓100.00%88.66%98.12%94.44%86.10%95.78%57.38%98.11%
Qwen 3.5 Plus (2026-04-20)89.79%Apr 20, 261m–95.33%85.18%97.14%96.42%80.60%97.70%67.53%98.38%
Gemma 4 31B (Reasoning)89.64%Apr 3, 26256k✓98.50%78.13%83.82%96.32%80.14%98.83%85.00%96.37%
Claude Opus 4.589.60%Nov 24, 25200k–100.00%81.71%99.66%89.84%84.25%97.69%72.61%91.03%
Z.AI GLM 589.60%Feb 11, 26200k✓98.50%83.63%92.06%94.11%84.09%98.59%67.78%98.04%
ByteDance Seed 1.689.59%Dec 23, 25256k✓98.33%78.43%95.63%90.83%79.79%98.40%77.71%97.57%
Grok 4.1 Fast89.55%Nov 19, 252m✓100.00%82.14%88.76%84.12%93.58%97.87%70.87%99.02%
GPT-5.289.45%Dec 10, 25400k–100.00%80.36%91.19%96.22%85.61%97.54%67.10%97.55%
GPT-5.589.37%Apr 24, 261m–100.00%90.39%94.15%81.88%87.68%98.20%72.34%90.28%
Qwen 3.6 Flash89.31%Apr 27, 261m✓99.78%86.02%89.33%96.09%79.51%96.09%71.50%96.13%
DeepSeek V4 Pro (Reasoning)89.28%Apr 24, 261m✓99.40%82.99%88.51%93.24%83.52%98.56%72.74%95.25%
Gemma 4 26B (Reasoning)89.02%Apr 3, 26256k✓99.04%76.38%95.20%95.69%74.05%98.26%74.75%98.79%
Gemini 3 Pro (Preview)88.79%Nov 18, 251m✓99.98%77.77%89.64%96.14%95.24%98.86%64.47%88.23%
o4 Mini High88.78%Apr 16, 25200k✓100.00%82.72%79.76%98.67%82.51%94.36%72.70%99.53%
Gemini 2.5 Pro88.44%Jun 17, 251m✓100.00%81.03%92.57%92.18%89.21%98.58%60.89%93.05%
Qwen 3.6 27B88.33%Apr 27, 26262.1k✓97.91%82.81%89.01%94.32%79.84%93.97%71.37%97.42%
Grok 488.12%Jul 9, 25256k✓99.99%77.34%90.61%89.67%96.01%98.76%63.09%89.45%
DeepSeek V4 Flash (Reasoning)88.06%Apr 24, 261m✓96.17%83.03%94.76%87.53%85.45%96.15%64.50%96.92%
Z.AI GLM 4.787.67%Dec 22, 25200k✓98.67%78.89%85.46%94.31%82.43%98.22%69.16%94.23%
Qwen 3.6 35B87.66%Apr 27, 26262.1k✓82.67%85.97%93.56%96.20%76.37%95.10%77.34%94.07%
Z.AI GLM 4.687.64%Sep 30, 25200k✓97.99%78.86%96.60%88.58%81.01%97.78%65.85%94.42%
Claude Sonnet 487.64%May 22, 25200k–100.00%79.21%91.31%84.02%75.86%99.13%81.52%90.04%
Claude Sonnet 4.587.54%Sep 29, 251m–97.75%84.19%92.39%83.78%78.61%99.02%76.80%87.78%
Stealth: Hunter Alpha87.34%Mar 11, 261m–99.99%79.18%93.35%84.63%91.67%95.53%63.63%90.78%
Claude Opus 487.22%May 22, 25200k–100.00%83.79%93.01%88.81%76.72%97.25%70.37%87.84%
Qwen 3.5 35B87.01%Feb 25, 26262k✓94.74%83.51%91.95%96.42%77.89%94.95%67.42%89.24%
GPT-4.186.82%Apr 14, 251m–97.71%81.24%93.91%90.57%74.40%94.40%66.78%95.60%
MiniMax M2.586.71%Feb 12, 26196k✓98.27%81.21%96.05%90.42%74.33%96.02%62.69%94.72%
Aion 2.086.66%Feb 23, 26131k✓95.10%80.24%96.17%90.91%78.64%95.34%63.77%93.10%
o4 Mini86.56%Apr 16, 25200k✓100.00%82.04%80.00%96.31%80.39%90.61%64.61%98.54%
MiniMax M2.786.23%Mar 18, 26204.8k✓99.32%81.70%84.80%95.50%72.76%92.14%68.90%94.69%
Qwen 3.5 Plus (2026-02-15)86.17%Feb 15, 261m–99.78%77.07%95.10%86.65%81.81%98.10%64.21%86.62%
Grok 4 Fast86.15%Sep 19, 252m✓99.65%77.03%84.61%76.76%94.89%97.26%67.91%91.09%
Xiaomi MIMO v2.5 Pro86.05%Apr 22, 261m✓100.00%81.08%87.69%82.62%79.73%95.96%64.29%97.02%
Stealth: Healer Alpha85.93%Mar 11, 26262k–100.00%78.28%88.45%82.30%91.67%96.04%56.03%94.67%
Gemini 3.1 Flash Lite (Reasoning)85.91%May 7, 261m✓99.81%76.31%96.70%92.32%75.32%96.90%62.26%87.63%
Gemini 3.5 Flash (Reasoning, Minimal)85.88%May 19, 261m–100.00%78.55%97.50%83.90%78.93%98.26%60.38%89.55%
ByteDance Seed 2.0 Mini85.69%Feb 26, 26262k✓99.70%80.11%90.12%91.88%79.82%91.08%58.77%94.07%
Qwen 3.5 Flash85.66%Feb 25, 261m✓89.39%83.81%91.94%96.11%79.34%92.80%63.19%88.70%
Gemini 3 Flash (Preview)85.47%Dec 17, 251m–98.04%75.04%95.00%86.39%80.99%97.54%65.14%85.61%
Gemini 3.1 Flash Lite (Preview)85.41%Feb 19, 261m–99.98%75.78%94.98%94.00%75.75%96.46%59.04%87.29%
Gemma 4 31B85.23%Apr 3, 26256k–100.00%75.59%75.00%86.69%78.18%98.56%72.72%95.09%
Gemini 3.1 Flash Lite85.09%May 7, 261m–99.97%76.01%90.90%92.77%74.62%97.35%61.21%87.86%
Z.AI GLM 4.584.95%Jul 25, 25131k–98.58%76.56%97.33%79.19%77.00%95.32%63.79%91.83%
Gemma 4 26B84.89%Apr 3, 26256k–91.94%75.17%92.02%83.17%73.83%97.04%71.75%94.22%
GPT-OSS 120B84.81%Aug 5, 25131k✓100.00%67.85%97.18%92.03%77.00%91.73%55.03%97.63%
GPT-4o, May 13th (temp=0)84.73%May 13, 24128k–97.18%74.89%98.72%83.13%74.69%95.35%73.24%80.64%
GPT-5.484.31%Mar 5, 261m–98.04%90.94%81.49%81.95%80.92%96.73%58.11%86.32%
Mistral Large 384.29%Dec 1, 25262k–97.22%81.21%92.02%84.91%75.26%94.09%64.41%85.18%
ByteDance Seed 2.0 Lite84.27%Mar 10, 26262k✓99.93%82.35%96.80%92.23%81.47%95.03%36.85%89.50%
Claude 3.5 Sonnet84.24%Jun 20, 24200k–100.00%78.69%85.62%76.75%90.30%96.57%69.67%76.31%
Gemini 2.5 Flash (Reasoning)84.14%Jun 17, 251m✓100.00%76.30%86.06%82.25%72.63%98.12%59.97%97.79%
DeepSeek-V2 Chat84.09%May 6, 24128k–98.80%77.20%100.00%83.82%70.75%90.90%68.78%82.48%
Qwen 3.5 9B84.05%Mar 10, 26262k✓96.84%84.35%88.18%94.02%70.90%85.35%60.98%91.81%
Xiaomi MIMO v2.583.95%Apr 22, 261m✓99.46%79.16%76.33%81.15%81.80%95.34%60.75%97.61%
Stealth: Aurora Alpha83.79%Feb 9, 26128k–99.69%67.54%92.50%92.59%90.11% – 44.19%99.93%
GPT-5.4 Mini (Reasoning, Low)83.57%Mar 17, 26400k✓100.00%87.72%92.45%88.49%74.50%92.63%33.99%98.78%
Claude 3.7 Sonnet83.39%Feb 19, 25200k–99.32%76.31%92.95%62.54%89.94%97.12%73.78%75.18%
Claude Haiku 4.583.36%Oct 15, 25200k–96.75%78.96%91.84%72.48%67.77%96.81%70.35%91.93%
Gemini 2.5 Flash Lite (Reasoning)83.10%Jul 22, 251m✓98.28%71.64%74.36%89.63%72.51%94.54%66.81%96.99%
GPT-4o, May 13th (temp=1)82.99%May 13, 24128k–97.31%75.88%92.52%80.69%72.06%92.41%69.88%83.19%
Grok 4.20 (Beta)82.64%Mar 12, 262m–100.00%82.80%91.17%82.15%72.06%95.49%53.89%83.57%
DeepSeek V3 (2024-12-26)82.62%Dec 26, 24163.8k–98.58%77.88%87.88%81.87%69.12%93.58%66.39%85.69%
DeepSeek V3.182.35%Aug 21, 25163.8k–96.80%77.45%96.87%76.65%72.08%87.27%66.15%85.55%
DeepSeek V3.282.22%Dec 1, 25163.8k✓99.99%79.95%85.01%81.58%75.48%95.78%53.75%86.25%
Z.AI GLM 4.7 Flash82.21%Jan 19, 26200k✓92.53%77.36%87.67%88.98%69.68%85.82%65.63%90.00%
GPT-4o, Aug. 6th (temp=0)82.18%Aug 6, 24128k–99.96%73.65%75.00%82.11%75.40%93.77%74.19%83.33%
DeepSeek V4 Pro82.05%Apr 24, 261m–99.57%83.70%72.80%77.57%71.72%97.98%63.74%89.36%
DeepSeek V4 Flash82.02%Apr 24, 261m–94.00%83.42%88.50%83.26%68.68%93.25%57.32%87.73%
Inception Mercury 281.99%Mar 4, 26128k–98.33%68.31%87.32%92.86%76.23%85.26%54.41%93.22%
Mistral Large 281.50%Jul 24, 24128k–97.31%81.86%85.22%69.19%75.76%94.16%63.05%85.45%
GPT-4.1 Mini81.40%Apr 14, 251m–96.94%74.52%89.64%82.30%67.35%95.62%58.59%86.23%
GPT-4o, Aug. 6th (temp=1)81.28%Aug 6, 24128k–98.86%75.50%82.21%82.44%70.58%86.72%67.91%86.00%
Grok 4.2081.21%Mar 31, 262m–95.00%83.44%78.86%84.11%72.46%95.63%59.71%80.48%
Hermes 3 405B80.80%Aug 15, 24128k–96.98%80.92%99.57%69.02%64.59%89.14%59.17%87.02%
Z.AI GLM 4.5 Air80.74%Jul 25, 25131k–98.75%74.61%95.05%76.57%68.46%94.38%44.11%94.00%
Gemini 2.5 Flash80.61%Jun 17, 251m–99.97%77.57%86.23%61.45%79.09%97.83%57.47%85.27%
GPT-5.4 Mini80.45%Mar 17, 26400k–98.55%88.10%88.75%79.37%70.07%90.60%46.32%81.85%
GPT-5 Nano80.16%Aug 7, 25400k✓99.34%67.04%77.18%93.91%70.50%82.74%57.57%92.99%
GPT-5.4 Nano (Reasoning)80.02%Mar 17, 26400k✓96.43%80.97%83.99%93.34%77.96%83.32%27.15%96.98%
DeepSeek V3 (2025-03-24)79.93%Mar 24, 25163.8k–86.36%82.34%86.42%80.62%69.32%89.57%67.94%76.88%
Gemini 2.5 Flash Lite79.91%Jul 22, 251m–96.00%75.05%82.75%80.14%65.19%92.13%59.96%88.08%
Mistral Large79.91%Feb 26, 2432k–96.39%82.02%88.64%73.04%68.92%95.14%49.87%85.27%
Inception Mercury79.50%Jun 25, 25128k–97.98%69.99%80.37%87.38%85.96%79.53%39.68%95.08%
Mistral Small 4 (Reasoning)79.48%Mar 16, 26265k✓97.28%81.67%60.53%85.61%66.73%90.58%60.28%93.18%
Qwen 3 32B79.37%Apr 28, 2541k–95.19%81.30%84.61%81.66%66.35%89.95%46.83%89.06%
GPT-4o Mini (temp=1)79.08%Jul 18, 24128k–99.26%74.37%77.50%82.16%80.28%85.78%56.50%76.78%
Nemotron 3 Super78.99%Mar 11, 26262k–79.58%69.75%81.41%95.29%70.62%86.34%57.43%91.51%
GPT-4o Mini (temp=0)78.29%Jul 18, 24128k–98.54%73.10%75.00%81.43%81.26%84.62%58.84%73.56%
Writer: Palmyra X578.11%Jan 26, 261m–96.12%83.95%56.58%79.71%69.64%91.20%67.19%80.49%
Qwen3 235B A22B Instruct 250778.07%Jul 21, 25262.1k–93.86%84.81%60.83%83.15%68.43%91.75%65.42%76.34%
Grok 4.378.00%Apr 30, 261m–91.21%84.51%84.74%66.41%70.17%90.19%49.02%87.79%
GPT-5.4 Nano (Reasoning, Low)77.46%Mar 17, 26400k✓91.46%80.93%81.87%91.42%64.43%82.23%31.65%95.71%
Llama 3.1 70B77.41%Jul 23, 24128k–87.82%72.78%80.18%81.03%61.45%92.10%63.45%80.47%
Mistral Small 3.2 24B77.36%Jun 20, 25131k–96.82%71.87%72.77%73.17%63.87%89.48%64.08%86.84%
Mistral Medium 3.176.08%Aug 13, 25131k–94.17%81.70%49.50%80.13%71.84%93.77%48.60%88.93%
Gemma 3 12B76.07%Mar 12, 25128k–94.16%75.38%80.10%79.28%56.28%85.18%61.05%77.13%
Gemma 3 27B75.70%Mar 12, 25128k–96.15%78.79%77.21%76.82%61.51%86.63%47.98%80.53%
Mistral Small 475.23%Mar 16, 26265k–93.85%81.12%51.96%78.28%61.67%91.00%62.17%81.83%
Llama 3.1 Nemotron 70B74.70%Oct 15, 24128k–95.74%71.71%46.80%88.31%82.19%87.26%50.62%74.99%
Nemotron 3 Nano74.50%Dec 14, 25262k–83.65%65.87%87.63%86.00%65.05%75.81%43.47%88.55%
Arcee AI: Trinity Large (Preview)73.33%Jan 27, 26128k–93.42%75.26%78.38%60.74%77.24%86.62%38.52%76.47%
Mistral Small Creative73.27%Dec 16, 2532k–86.85%80.29%41.85%76.28%87.99%90.31%48.15%74.46%
Qwen 2.5 72B73.17%Sep 19, 24131.1k–96.82%75.16%68.95%76.43%61.71%89.18%31.55%85.54%
Cydonia 24B V4.172.68%Sep 27, 25131k–86.72%74.19%72.49%69.32%56.95%86.15%50.36%85.22%
GPT-5.4 Nano72.16%Mar 17, 26400k–91.85%80.50%80.82%78.57%58.53%79.22%20.94%86.86%
WizardLM 2 8x22b71.45%Apr 15, 2465k–91.22%79.06%78.05%67.14%60.41%88.13%28.27%79.28%
ByteDance Seed 1.6 Flash71.22%Dec 23, 25256k✓48.22%81.51%61.23%84.16%64.29%91.64%47.15%91.56%
Ministral 3 14B70.45%Dec 2, 25262k–87.84%79.11%30.00%79.03%67.43%86.20%50.83%83.19%
Claude 3 Haiku70.13%Mar 13, 24200k–95.06%74.53%72.76%68.47%65.08%64.36%51.15%69.60%
Ministral 3 8B69.98%Dec 2, 25262k–95.77%77.26%48.96%74.43%59.53%78.52%31.34%94.02%
GPT-4.1 Nano69.90%Apr 14, 251m–78.47%71.81%78.95%68.45%54.55%76.06%40.88%90.07%
Hermes 3 70B69.74%Aug 15, 24128k–94.47%77.41%81.66%61.15%56.20%63.34%53.00%70.71%
Arcee AI: Trinity Mini67.68%Dec 1, 25128k✓95.64%74.01%70.59%59.94%55.45%73.88%23.57%88.35%
Cohere Command R+ (Aug. 2024)67.04%Aug 31, 24128k–87.53%77.70%66.58%59.51%51.60%68.40%58.70%66.30%
Gemma 3 4B66.33%Mar 12, 25128k–92.40%72.10%72.28%60.30%52.86%78.38%26.37%75.94%
Ministral 3 3B65.02%Dec 2, 25131k–89.41%75.45%68.10%72.38%54.06%69.80%15.87%75.10%
Mistral NeMO63.80%Jul 18, 24128k–79.34%76.72%80.80%51.55%44.87%73.69%34.11%69.32%
Ministral 8B63.77%Oct 16, 24128k–84.65%76.87%53.91%46.82%63.20%77.52%15.27%91.89%
Skyfall 36B V263.65%Mar 10, 2533k–67.04%83.32%73.94%52.53%47.52%76.69%41.44%66.73%
Llama 3.1 8B61.44%Jul 23, 24128k–68.57%76.54%64.06%74.82%47.82%75.45%34.03%50.23%
Ministral 3B59.25%Oct 16, 24128k–84.70%75.49%42.25%49.17%53.21%70.91%24.45%73.79%
LFM2 24B57.93%Feb 25, 2632k–15.71%78.10%64.64%69.48%52.88%71.56%24.12%86.93%
Rocinante 12B54.02%Sep 30, 2432k–39.83%81.94%63.45%48.47%45.25%56.31%41.51%55.39%
Model Performance
Cost vs Performance

Compares total benchmark cost against overall score. Quadrant lines are drawn at the median values. Only models with available cost data are shown.

7 low-scoring outliers hidden: Mistral NeMO (63.8%), Ministral 8B (63.8%), Skyfall 36B V2 (63.7%), Llama 3.1 8B (61.4%), Ministral 3B (59.2%), LFM2 24B (57.9%), Rocinante 12B (54.0%).

Cost Breakdown

Total benchmark cost per model, broken down by input, reasoning, and output tokens. Toggle between USD and token views. Only models with available cost data are shown.