Parse dialogue

Test: Language Writing

Avg. Score
85.3%
Scenarios
5

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Stealth: Aurora Alpha100.0%2.0s100%
2GPT-4o Mini (temp=0)100.0%$0.00034.8s100%
3GPT-4o Mini (temp=1)100.0%$0.00035.6s100%
4Inception Mercury 299.6%$0.00061.4s96%
5Gemini 3 Flash (Preview)100.0%$0.00205.6s100%
6GPT-5.4 Mini (Reasoning, Low)99.9%$0.00223.5s99%
7GPT-4.1 Mini99.3%$0.00043.4s93%
8DeepSeek-V2 Chat100.0%$0.000116.1s100%
9GPT-4o, Aug. 6th (temp=0)100.0%$0.00526.1s100%
10Z.AI GLM 4.599.7%$0.001314.5s97%
11Z.AI GLM 5 Turbo99.8%$0.003714.7s98%
12Hermes 3 405B99.1%$0.000021.0s94%
13GPT-5.4 Mini97.5%$0.00202.5s87%
14GPT-4o, Aug. 6th (temp=1)99.4%$0.00566.5s94%
15o4 Mini100.0%$0.007116.7s100%
16Nemotron 3 Super97.8%$0.000021.7s91%
17GPT-5.4 Nano (Reasoning)98.0%$0.00166.1s83%
18Grok 4.20 (Beta)97.3%$0.00343.3s84%
19Gemini 2.5 Flash97.5%$0.00266.0s83%
20Claude Sonnet 4.6100.0%$0.01013.7s100%
21DeepSeek V3.198.7%$0.000722.8s88%
22Inception Mercury95.7%$0.00021.5s74%
23Nemotron 3 Nano95.3%$0.000210.8s77%
24GPT-4.197.8%$0.00416.8s79%
25GPT-4o, May 13th (temp=0)97.4%$0.00878.0s89%
26Gemini 3.1 Flash Lite (Preview)95.0%$0.00113.7s72%
27MiniMax M2.597.1%$0.001321.5s81%
28Gemini 2.5 Flash (Reasoning)97.1%$0.006713.1s86%
29GPT-5.4 Mini (Reasoning)96.2%$0.00467.3s77%
30GPT-4.1 Nano92.9%$0.00014.0s69%
31GPT-5.4 Nano (Reasoning, Low)93.7%$0.00187.0s69%
32Gemini 3 Flash (Preview, Reasoning)94.9%$0.005111.8s74%
33Claude Opus 4.599.3%$0.01915.7s96%
34Claude 3.5 Sonnet96.2%$0.009515.7s80%
35Stealth: Hunter Alpha91.7%$0.000022.8s69%
36GPT-5.498.0%$0.01319.0s85%
37Claude Haiku 4.588.7%$0.00377.8s67%
38GPT-5 Nano99.4%$0.00281.2m96%
39GPT-5 Mini98.0%$0.006637.2s82%
40Claude 3.7 Sonnet95.9%$0.01214.8s80%
41o4 Mini High99.5%$0.01536.6s97%
42GPT-5.4 (Reasoning, Low)96.6%$0.01216.1s77%
43GPT-5.4 Nano91.6%$0.00176.6s55%
44Claude Sonnet 492.6%$0.009211.1s71%
45Grok 4.1 Fast87.5%$0.000715.1s61%
46Gemma 3 27B89.4%$0.000327.9s64%
47Grok 4 Fast84.2%$0.00057.6s56%
48Gemini 2.5 Flash Lite85.5%$0.00055.1s53%
49Gemma 3 12B90.2%$0.000233.9s66%
50Qwen 3.5 Plus (2026-02-15)90.2%$0.001927.2s65%
51Llama 3.1 70B85.4%$0.000720.5s62%
52Stealth: Healer Alpha86.9%$0.000016.0s56%
53Mistral Large87.3%$0.00329.6s57%
54Hermes 3 70B88.3%$0.000316.6s54%
55GPT-5.297.4%$0.01626.0s82%
56Mistral Large 384.0%$0.001417.3s59%
57Aion 2.092.3%$0.003041.5s69%
58Claude 3 Haiku80.5%$0.00073.8s50%
59Grok 4.20 (Beta, Reasoning)98.2%$0.02318.2s82%
60Claude Sonnet 4.589.8%$0.01113.2s62%
61ByteDance Seed 1.6 Flash82.5%$0.000614.6s48%
62Grok 496.2%$0.01631.8s74%
63Claude 3.5 Haiku74.2%$0.00155.7s51%
64Qwen 3.5 Flash88.9%$0.001942.8s60%
65GPT-5.4 (Reasoning)94.8%$0.01730.6s74%
66ByteDance Seed 1.691.3%$0.004050.4s65%
67GPT-4o, May 13th (temp=1)90.0%$0.00928.2s46%
68Arcee AI: Trinity Large (Preview)76.8%$0.000011.2s38%
69Qwen 3.5 122B90.0%$0.01335.0s65%
70Gemini 2.5 Flash Lite (Reasoning)83.7%$0.001923.9s41%
71Qwen 3.5 27B91.0%$0.009648.7s63%
72Arcee AI: Trinity Mini81.2%$0.000215.9s32%
73MoonshotAI: Kimi K2.594.2%$0.00821.2m68%
74Llama 3.1 8B73.1%$0.00014.6s26%
75Claude Sonnet 4.6 (Reasoning)95.2%$0.02531.1s72%
76Z.AI GLM 4.693.2%$0.00441.2m60%
77Gemma 3 4B74.6%$0.000113.4s29%
78ByteDance Seed 2.0 Lite93.6%$0.00691.3m67%
79Gemini 2.5 Pro95.1%$0.02422.5s61%
80DeepSeek V3.290.0%$0.00041.1m45%
81Qwen 3.5 35B83.9%$0.01033.8s49%
82Z.AI GLM 589.1%$0.00591.3m63%
83Cohere Command R+ (Aug. 2024)73.2%$0.006213.4s38%
84GPT-5.197.3%$0.02752.6s81%
85Z.AI GLM 4.7 Flash80.3%$0.001252.0s45%
86Qwen 3 32B79.2%$0.000635.4s33%
87Ministral 3B59.5%$0.00002.9s30%
88Mistral Small 3.2 24B70.5%$0.000311.0s20%
89Mistral Small 4 (Reasoning)71.1%$0.001115.3s21%
90DeepSeek V3 (2025-03-24)72.8%$0.000514.6s16%
91DeepSeek V3 (2024-12-26)75.8%$0.000719.4s15%
92Mistral NeMO66.6%$0.00014.3s10%
93Qwen 2.5 72B67.9%$0.000419.8s20%
94MiniMax M2.769.6%$0.002128.4s26%
95Qwen 3.5 9B86.4%$0.00111.6m51%
96LFM2 24B64.3%$0.000112.7s16%
97Z.AI GLM 4.790.9%$0.00541.9m64%
98Gemini 3.1 Pro (Preview)94.8%$0.03740.3s71%
99Mistral Large 270.4%$0.004218.6s16%
100Claude Opus 4.692.3%$0.03634.3s67%
101Ministral 8B52.8%$0.00016.6s14%
102WizardLM 2 8x22b61.1%$0.000815.6s12%
103Mistral Small 448.9%$0.00046.4s17%
104Gemini 3 Pro (Preview)84.3%$0.02821.6s41%
105Rocinante 12B51.9%$0.000322.8s21%
106ByteDance Seed 2.0 Mini90.2%$0.00232.5m66%
107Claude Opus 4.6 (Reasoning)92.2%$0.04038.1s65%
108Mistral Medium 3.149.0%$0.002016.7s17%
109Mistral Small Creative33.7%$0.00046.3s23%
110Ministral 3 8B47.9%$0.00026.1s3%
111Qwen 3.5 397B A17B90.0%$0.0152.0m60%
112GPT-598.0%$0.0441.3m81%
113Claude Opus 486.0%$0.04824.7s61%
114Ministral 3 3B36.2%$0.00012.6s0%
115Qwen3 235B A22B Instruct 250746.7%$0.000422.5s0%
116Writer: Palmyra X543.2%$0.005113.6s0%
117Llama 3.1 Nemotron 70B33.6%$0.000322.3s0%
118Ministral 3 14B10.0%$0.00039.0s0%
85.28%

Individual Scenarios

Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5.1100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
MiniMax M2.5100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Stealth: Hunter Alpha100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
GPT-5.4100100100100100100.0%
Claude 3.5 Sonnet100100100100100100.0%
Grok 4.20 (Beta)100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
GPT-5.4 Nano100100100100100100.0%
Hermes 3 70B100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
GPT-5.21001001001009999.8%
GPT-5.4 Nano (Reasoning, Low)1001001001009899.7%
Claude Sonnet 4.6 (Reasoning)1001001001009699.1%
Gemini 2.5 Pro1001001001009599.1%
GPT-5.4 (Reasoning)100100100989799.0%
GPT-5 Mini1001001001009498.9%
Qwen 3.5 Flash1001001001009498.8%
Gemini 2.5 Flash1001001001009498.8%
GPT-5 Nano1001001001009398.6%
Gemma 3 27B1001001001009398.6%
DeepSeek V3.21001001001009298.5%
Gemini 2.5 Flash Lite (Reasoning)1001001001009298.3%
GPT-4o, May 13th (temp=1)1001001001009298.3%
Gemma 3 12B1001001001009298.3%
Gemini 3 Pro (Preview)100100100949497.7%
Claude 3.7 Sonnet100100100949497.7%
Mistral Large1001001001008697.1%
Nemotron 3 Nano1001001001008697.1%
Nemotron 3 Super1001001001008596.9%
Claude Sonnet 4100100100939196.8%
Z.AI GLM 4.7 Flash100100100949096.8%
GPT-4.1 Mini1001001001008296.4%
ByteDance Seed 1.6 Flash100100100919096.2%
Gemma 3 4B1001001001008096.0%
GPT-5.4 Mini1001001001007995.8%
Grok 4 Fast100100100918895.7%
Grok 4.1 Fast1001001001007895.6%
Qwen 3.5 Plus (2026-02-15)1009494939394.8%
Mistral Small 4 (Reasoning)100100100928194.6%
Claude Opus 41009494928993.9%
Gemini 2.5 Flash Lite10010093928393.8%
Claude Haiku 4.5100100100917793.6%
Llama 3.1 70B10010092918593.6%
Stealth: Healer Alpha10010096917792.9%
GPT-4.1 Nano100100100907392.5%
Z.AI GLM 51009490888691.7%
Qwen 3.5 35B100100100965990.9%
Qwen 3.5 9B1009695935687.9%
ByteDance Seed 2.0 Mini100100100864085.1%
Arcee AI: Trinity Large (Preview)10010091785584.6%
Mistral Large 31009280766783.1%
Mistral Large 2100100100100080.0%
DeepSeek V3 (2024-12-26)10010010094078.8%
Llama 3.1 8B10010010090078.0%
Qwen 3 32B10010010089077.8%
Writer: Palmyra X510010010086077.1%
Rocinante 12B1009467635074.6%
Mistral Medium 3.11007567505068.3%
Ministral 8B1001007367067.9%
Cohere Command R+ (Aug. 2024)927065635067.8%
Ministral 3B1001007150064.3%
MiniMax M2.7100956150061.2%
Claude 3.5 Haiku646464605561.1%
Mistral NeMO1001001000060.0%
Claude 3 Haiku1005050504358.6%
Qwen 2.5 72B10095850056.0%
WizardLM 2 8x22b100100710054.3%
Ministral 3 8B100100500050.0%
Mistral Small 48571600043.1%
LFM2 24B10010000040.0%
Mistral Small 3.2 24B1005000030.0%
Mistral Small Creative5050500030.0%
Qwen3 235B A22B Instruct 2507100000020.0%
Llama 3.1 Nemotron 70B4400008.9%
Ministral 3 14B000000.0%
Ministral 3 3B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Qwen 3.5 122B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
GPT-5.4 Nano (Reasoning)100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Mistral Large100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100.0%
GPT-5.41001001001009999.8%
GPT-5.4 (Reasoning)1001001001009899.7%
GPT-5.4 Mini (Reasoning, Low)1001001001009799.5%
GPT-5.4 Nano100100100989899.3%
Claude Sonnet 4.6 (Reasoning)1001001001009598.9%
Claude Opus 4.51001001001009598.9%
Claude 3.7 Sonnet1001001001009598.9%
Gemini 3 Flash (Preview, Reasoning)1001001001009498.8%
o4 Mini High1001001001009498.8%
Gemini 2.5 Pro1001001001009498.8%
Qwen 3.5 Flash1001001001009498.8%
MiniMax M2.51001001001009398.6%
Mistral Large 31001001001009398.6%
GPT-5.4 Nano (Reasoning, Low)100100100989598.5%
Grok 4.1 Fast1001001001009298.3%
Nemotron 3 Super1001001001009298.3%
DeepSeek V3.21001001001009298.3%
Aion 2.01001001001009198.2%
ByteDance Seed 2.0 Mini1001001001009198.2%
Hermes 3 405B1001001001009198.2%
GPT-5.21009998979698.1%
GPT-5.11001001001009097.9%
Claude 3.5 Sonnet100100100959497.8%
Grok 4.20 (Beta)1001001001008997.8%
DeepSeek V3 (2025-03-24)1001001001008997.8%
Gemini 2.5 Flash Lite1001001001008997.8%
Inception Mercury1001001001008997.8%
Claude Opus 4100100100959397.7%
Claude Sonnet 4100100100939397.1%
Qwen 3.5 9B100100100949197.0%
GPT-4o, May 13th (temp=1)100100100939196.8%
Claude Sonnet 4.510010095949396.5%
Gemma 3 4B100100100928695.6%
Z.AI GLM 510010095958895.6%
Gemma 3 27B100100100928395.1%
Gemma 3 12B10010091919094.4%
Grok 4 Fast10010091898693.1%
Llama 3.1 70B100100100838092.7%
Arcee AI: Trinity Large (Preview)1001001001006392.6%
Claude Haiku 4.510010094887992.2%
ByteDance Seed 1.6 Flash100100100808092.0%
GPT-4.1 Nano1001001001006092.0%
Stealth: Healer Alpha100100100857391.7%
Z.AI GLM 4.7 Flash100100100886289.8%
Mistral Small 410010080736383.2%
Stealth: Hunter Alpha10010087675381.3%
LFM2 24B10010090575780.9%
Gemini 3 Pro (Preview)100100100100080.0%
Qwen 3.5 35B100100100100080.0%
DeepSeek V3 (2024-12-26)100100100100080.0%
Qwen 3 32B100100100100080.0%
Mistral Small 3.2 24B100100100100080.0%
Gemini 2.5 Flash Lite (Reasoning)10010010090078.0%
Z.AI GLM 4.61001009594077.8%
Llama 3.1 8B1001009190076.2%
MiniMax M2.71001009286075.5%
Claude 3.5 Haiku1007570646073.7%
Hermes 3 70B10010010060072.0%
Mistral Small 4 (Reasoning)1001009853070.2%
Rocinante 12B1001008067069.3%
Claude 3 Haiku10010050504368.6%
Cohere Command R+ (Aug. 2024)867570625068.5%
Mistral Large 21001001000060.0%
Mistral Medium 3.11001005050060.0%
Ministral 3 3B100100900058.0%
Llama 3.1 Nemotron 70B100100860057.1%
Qwen3 235B A22B Instruct 2507100100670053.3%
Ministral 8B10092690052.2%
WizardLM 2 8x22b10088630050.0%
Ministral 3B67645858049.3%
Mistral NeMO100100330046.7%
Qwen 2.5 72B10010000040.0%
Mistral Small Creative5050500030.0%
Writer: Palmyra X5100000020.0%
Ministral 3 8B68000013.7%
Ministral 3 14B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Z.AI GLM 5 Turbo100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
o4 Mini High100100100100100100.0%
o4 Mini100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
Nemotron 3 Nano100100100100100100.0%
Claude 3 Haiku100100100100100100.0%
Claude Opus 4.51001001001009599.0%
GPT-4o, May 13th (temp=0)1001001001009298.5%
Hermes 3 405B1001001001008897.5%
GPT-4o, Aug. 6th (temp=1)1001001001008697.1%
Nemotron 3 Super100100100908895.5%
Gemini 2.5 Flash (Reasoning)1001001001006793.3%
Z.AI GLM 4.6100100100956792.3%
GPT-5.41001001001006192.2%
Grok 4.20 (Beta)1001001001006092.0%
GPT-5.21001001001005691.2%
GPT-5.4 Nano (Reasoning)1001001001005691.2%
GPT-5 Mini1001001001005591.0%
Inception Mercury1001001001005590.9%
Grok 4.20 (Beta, Reasoning)1001001001005490.8%
GPT-51001001001005390.5%
GPT-5.11001001001005190.3%
Gemini 2.5 Flash100100100925890.0%
GPT-4.1 Nano1001001001005090.0%
MiniMax M2.5100100100935389.2%
GPT-4.11001001001004589.1%
Claude 3.7 Sonnet100100100955089.0%
Claude 3.5 Sonnet100100100945088.8%
Hermes 3 70B100100100716086.3%
Claude 3.5 Haiku1008988807586.3%
GPT-5.4 (Reasoning, Low)100100100595783.2%
GPT-5.4 Mini (Reasoning)100100100605582.9%
Claude Haiku 4.5100100100605382.5%
Grok 4100100100565081.1%
Stealth: Hunter Alpha10010093674581.1%
GPT-5.4 Nano10010098555080.7%
Gemini 2.5 Pro100100100100080.0%
Mistral NeMO100100100100080.0%
Claude Sonnet 4.6 (Reasoning)100100100504779.5%
Mistral Small 4 (Reasoning)10010091505078.2%
Claude Sonnet 4939292575377.7%
GPT-4o, May 13th (temp=1)10010010088077.6%
GPT-5.4 (Reasoning)10010065585776.0%
Gemini 3 Flash (Preview, Reasoning)10010063585775.6%
Stealth: Healer Alpha1009072575775.2%
Gemini 3.1 Flash Lite (Preview)10010067575074.8%
Aion 2.010010063605074.5%
Gemini 3.1 Pro (Preview)10010067535074.0%
Z.AI GLM 4.7 Flash10010057505071.4%
GPT-5.4 Nano (Reasoning, Low)1009955534971.1%
MoonshotAI: Kimi K2.510010054544771.0%
Gemini 2.5 Flash Lite937864606070.9%
ByteDance Seed 2.0 Mini1006767605770.1%
Qwen 3.5 Plus (2026-02-15)10010054504670.0%
ByteDance Seed 2.0 Lite10010057504370.0%
Cohere Command R+ (Aug. 2024)1001008855068.4%
Gemma 3 27B1008750505067.3%
Llama 3.1 70B1006057565465.3%
Gemma 3 12B1008155533865.2%
Qwen 3.5 35B1006258535265.0%
Gemini 2.5 Flash Lite (Reasoning)1006756505064.4%
Claude Opus 4.61005955545264.0%
Claude Opus 4.6 (Reasoning)1005552525262.0%
Arcee AI: Trinity Mini1001005550060.9%
Qwen 3 32B786355545460.5%
Llama 3.1 8B717056545060.2%
Qwen 2.5 72B1005348464658.6%
Z.AI GLM 51005050474458.3%
WizardLM 2 8x22b100646357056.8%
Qwen 3.5 9B100765850056.8%
ByteDance Seed 1.6636055555056.3%
Arcee AI: Trinity Large (Preview)100100800056.0%
Qwen 3.5 122B575656555555.8%
Qwen 3.5 27B705653504755.3%
MiniMax M2.7685555504754.8%
Z.AI GLM 4.7585554545354.6%
Claude Opus 4785550474254.3%
Mistral Large 3605750505053.4%
DeepSeek V3.2100100670053.3%
Ministral 3B605350505052.7%
Claude Sonnet 4.5625350504752.4%
Grok 4.1 Fast565650504551.3%
DeepSeek V3 (2025-03-24)100100560051.1%
Grok 4 Fast575650454450.5%
Qwen 3.5 397B A17B535050504750.1%
Mistral Large505050505050.0%
Mistral Small Creative505050505050.0%
Qwen 3.5 Flash535250474449.4%
Gemini 3 Pro (Preview)60565454044.8%
ByteDance Seed 1.6 Flash58565453044.2%
Mistral Small 3.2 24B59555050042.7%
Mistral Large 210056550042.2%
Gemma 3 4B58505045040.8%
DeepSeek V3 (2024-12-26)10010000040.0%
Qwen3 235B A22B Instruct 250710010000040.0%
Ministral 3 8B10010000040.0%
Mistral Small 46057560034.7%
Ministral 8B6456500034.0%
LFM2 24B705700025.4%
Ministral 3 3B675600024.4%
Mistral Medium 3.1675000023.3%
Rocinante 12B675000023.3%
Llama 3.1 Nemotron 70B605000022.0%
Writer: Palmyra X5000000.0%
Ministral 3 14B000000.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100.0%
GPT-5.4 (Reasoning)100100100100100100.0%
GPT-5 Mini100100100100100100.0%
Claude Opus 4.6100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
GPT-5.4 Mini (Reasoning)100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
o4 Mini High100100100100100100.0%
Aion 2.0100100100100100100.0%
Z.AI GLM 4.6100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
Qwen 3.5 Flash100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-5 Nano100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
GPT-5.4 Mini100100100100100100.0%
Mistral Large 2100100100100100100.0%
DeepSeek V3.1100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Gemini 2.5 Flash100100100100100100.0%
Inception Mercury100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
GPT-5.4 (Reasoning, Low)1001001001009899.7%
GPT-5.4 Nano (Reasoning)1001001001009899.6%
GPT-51001001001009899.5%
GPT-5.210010099999899.1%
GPT-5.1100100100999799.1%
Claude Sonnet 4.6 (Reasoning)1001001001009599.1%
Z.AI GLM 5 Turbo1001001001009599.0%
Stealth: Hunter Alpha1001001001009599.0%
Gemini 3 Pro (Preview)1001001001009598.9%
Gemini 2.5 Pro1001001001009598.9%
GPT-5.4100100100999598.8%
Qwen 3.5 9B1001001001009498.8%
Claude Opus 4.51001001001009398.6%
Qwen 3.5 122B100100100969698.5%
MiniMax M2.5100100100979698.5%
Nemotron 3 Super1001001001009298.3%
Z.AI GLM 4.51001001001009298.3%
Grok 4.20 (Beta)1001001001009298.3%
Inception Mercury 21001001001009198.2%
ByteDance Seed 2.0 Lite1001001001009098.0%
ByteDance Seed 2.0 Mini1001001001008997.8%
Gemini 2.5 Flash Lite (Reasoning)1001001001008997.8%
Gemma 3 12B1001001001008997.8%
Claude 3.7 Sonnet100100100959497.8%
MiniMax M2.710010098969497.5%
Claude Sonnet 4100100100949297.2%
ByteDance Seed 1.6 Flash100100100939296.9%
Grok 4.1 Fast1001001001008096.0%
Grok 4 Fast1001001001008096.0%
Claude 3.5 Sonnet100100100928795.8%
Mistral Large 3100100100938395.2%
Gemini 2.5 Flash (Reasoning)10010095918994.9%
Gemini 2.5 Flash Lite10010092908693.5%
Claude Haiku 4.510010094937893.0%
Qwen 3.5 35B100100100956792.3%
Claude Opus 4949392928591.4%
GPT-4.1 Nano1001001001005090.0%
Gemma 3 27B939392868589.9%
Hermes 3 70B100100100915889.8%
Qwen 2.5 72B100100100945389.4%
Mistral Large1001001001004689.2%
Llama 3.1 70B10010093836788.6%
Nemotron 3 Nano100100100835086.7%
LFM2 24B100100100735084.5%
Cohere Command R+ (Aug. 2024)10010078706081.6%
GPT-4o, May 13th (temp=1)100100100100080.0%
DeepSeek V3 (2024-12-26)100100100100080.0%
Qwen 3 32B100100100100080.0%
Mistral NeMO100100100100080.0%
Claude 3.5 Haiku1008680706479.9%
GPT-5.4 Nano10010010097079.5%
Claude 3 Haiku100100100504078.0%
Z.AI GLM 4.7 Flash10010095504377.6%
Ministral 3 8B10010010086077.1%
Stealth: Healer Alpha10010010081076.2%
Mistral Small 4 (Reasoning)10010010073074.5%
Arcee AI: Trinity Large (Preview)10010064585074.4%
Llama 3.1 8B100978883073.5%
Ministral 3B1008664575672.5%
Gemma 3 4B1001008573071.5%
WizardLM 2 8x22b10010010056071.1%
Llama 3.1 Nemotron 70B10010010050070.0%
Ministral 8B100916454061.8%
Qwen3 235B A22B Instruct 25071001001000060.0%
Writer: Palmyra X51001001000060.0%
Arcee AI: Trinity Mini1001001000060.0%
DeepSeek V3 (2025-03-24)100100890057.8%
Mistral Small 48581710047.3%
Ministral 3 3B1009300038.6%
Mistral Small Creative9250500038.5%
Rocinante 12B7170430036.9%
Mistral Medium 3.15050500030.0%
Ministral 3 14B1005000030.0%
Model # 1 # 2 # 3 # 4 # 5 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100.0%
Z.AI GLM 5 Turbo100100100100100100.0%
GPT-5 Mini100100100100100100.0%
GPT-5100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100.0%
Grok 4.20 (Beta, Reasoning)100100100100100100.0%
GPT-5.4 (Reasoning, Low)100100100100100100.0%
Z.AI GLM 5100100100100100100.0%
Claude Sonnet 4.6100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100.0%
Qwen 3.5 27B100100100100100100.0%
ByteDance Seed 1.6100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100.0%
Claude Opus 4.5100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100.0%
Z.AI GLM 4.7100100100100100100.0%
GPT-4.1100100100100100100.0%
o4 Mini100100100100100100.0%
Grok 4100100100100100100.0%
Claude Sonnet 4.5100100100100100100.0%
ByteDance Seed 2.0 Mini100100100100100100.0%
Z.AI GLM 4.5100100100100100100.0%
Gemini 3.1 Flash Lite (Preview)100100100100100100.0%
GPT-5.4 Mini (Reasoning, Low)100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100.0%
DeepSeek-V2 Chat100100100100100100.0%
ByteDance Seed 2.0 Lite100100100100100100.0%
Nemotron 3 Super100100100100100100.0%
Inception Mercury 2100100100100100100.0%
Stealth: Aurora Alpha100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100.0%
GPT-4.1 Mini100100100100100100.0%
Hermes 3 405B100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100.0%
DeepSeek V3.2100100100100100100.0%
Mistral Large100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100.0%
Mistral Small 3.2 24B100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100.0%
GPT-4.1 Nano100100100100100100.0%
GPT-5.4 Nano (Reasoning, Low)100100100999899.4%
GPT-5.4 (Reasoning)1001001001009799.3%
GPT-5.4 Nano (Reasoning)100100100989899.2%
MiniMax M2.51001001001009699.2%
Claude Opus 4.6 (Reasoning)1001001001009699.2%
GPT-5.4100100100999799.2%
Claude Sonnet 4.6 (Reasoning)1001001001009699.1%
GPT-5.1100100100989799.1%
Gemini 2.5 Pro1001001001009498.9%
o4 Mini High1001001001009498.8%
GPT-5.4 Nano10010099989798.8%
Claude 3.5 Sonnet1001001001009498.8%
GPT-5.21009999999798.7%
Stealth: Healer Alpha1001001001009398.6%
Grok 4.20 (Beta)1001001001009398.6%
Gemini 2.5 Flash1001001001009398.6%
GPT-5.4 Mini (Reasoning)1001001001009298.3%
GPT-5 Nano1001001001009198.2%
Qwen 3 32B1001001001008997.8%
Qwen 3.5 Flash1001001001008897.5%
Claude 3 Haiku1001001001008897.5%
GPT-4o, May 13th (temp=1)100100100949397.5%
Gemini 2.5 Flash (Reasoning)100100100959297.4%
Claude Opus 4.610010096959597.3%
Stealth: Hunter Alpha10010096959497.1%
Grok 4.1 Fast1001001001008296.4%
Gemma 3 27B10010094949396.2%
Claude 3.7 Sonnet100100100948696.0%
Z.AI GLM 4.61009696959395.9%
Qwen 3.5 122B10010094949095.8%
Qwen 2.5 72B100100100898995.6%
Gemma 3 12B1001001001007695.3%
Claude Sonnet 410010093928694.2%
DeepSeek V3.11001001001006893.7%
Hermes 3 70B100100100888093.5%
Claude Opus 41009393928692.7%
Nemotron 3 Nano100100100887592.5%
GPT-5.4 Mini10010096828091.7%
Qwen 3.5 35B1001001001005791.3%
Qwen 3.5 9B10010094907391.3%
LFM2 24B10010089897590.6%
Inception Mercury1001001001005090.0%
Mistral Large 310010093906789.9%
Aion 2.010010095945689.0%
GPT-4o, May 13th (temp=0)939392917588.8%
Llama 3.1 70B10010086757386.7%
Qwen 3.5 Plus (2026-02-15)100100100755686.3%
Grok 4 Fast10010089707085.8%
Arcee AI: Trinity Mini100100100755085.0%
ByteDance Seed 1.6 Flash949389895083.0%
Claude Haiku 4.510010082656382.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100080.0%
Cohere Command R+ (Aug. 2024)10010090555379.6%
Llama 3.1 8B10010010089077.8%
Arcee AI: Trinity Large (Preview)10010062605976.1%
WizardLM 2 8x22b10010010067073.3%
Gemini 2.5 Flash Lite100918780071.5%
Claude 3.5 Haiku907367645870.3%
Mistral Large 210010010050070.0%
Gemma 3 4B1001008362069.0%
Mistral NeMO1001009240066.3%
Z.AI GLM 4.7 Flash1001006764066.1%
Mistral Medium 3.11006750505063.3%
Qwen3 235B A22B Instruct 25071001001000060.0%
Ministral 3 3B1001001000060.0%
MiniMax M2.7100100950058.9%
Ministral 3 8B100100940058.8%
Ministral 3B865755504658.7%
Writer: Palmyra X5100100930058.7%
DeepSeek V3 (2025-03-24)100100880057.5%
Rocinante 12B646463543355.4%
Ministral 8B10082590048.2%
Mistral Small 4 (Reasoning)1008900037.8%
Mistral Small 46460580036.4%
Mistral Small Creative505000020.0%
Ministral 3 14B100000020.0%
Llama 3.1 Nemotron 70B50000010.0%