Dialogue content preserved

Test: Text Replacement

Avg. Score
93.8%
Scenarios
8

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Qwen 2.5 72B99.6%$0.000312.0s96%
2Gemini 3 Flash (Preview)98.6%$0.00213.5s93%
3Gemini 2.5 Flash Lite97.0%$0.00031.9s91%
4Qwen 3.5 Plus (2026-02-15)97.9%$0.00177.8s92%
5Writer: Palmyra X598.6%$0.004011.7s93%
6Mistral Large 397.5%$0.00128.5s91%
7Claude Haiku 4.597.5%$0.00403.8s91%
8Gemma 3 4B96.8%$0.00016.9s88%
9Mistral Large97.5%$0.00498.3s91%
10Mistral Large 297.5%$0.00498.4s91%
11Claude Opus 4.5100.0%$0.0206.0s100%
12Claude Opus 4.6100.0%$0.0206.6s100%
13Arcee AI: Trinity Large (Preview)97.9%$0.000026.5s92%
14Claude Sonnet 4.698.8%$0.0125.4s93%
15Claude Sonnet 4.598.7%$0.0125.4s93%
16Grok 4 Fast96.4%$0.00108.5s86%
17Claude Sonnet 497.9%$0.0126.9s92%
18Claude 3.7 Sonnet97.5%$0.0126.5s91%
19GPT-4o, May 13th (temp=1)97.1%$0.0124.0s91%
20GPT-4o, May 13th (temp=0)97.0%$0.0123.8s91%
21Gemini 2.5 Flash95.2%$0.00172.4s83%
22GPT-4.1 Mini94.8%$0.00128.2s86%
23GPT-4.196.6%$0.00604.8s82%
24Mistral Small 3.2 24B97.3%$0.00025.5s73%
25Gemini 3 Flash (Preview, Reasoning)98.9%$0.01524.9s93%
26Grok 4.1 Fast95.7%$0.001215.5s80%
27Mistral Medium 3.193.8%$0.00146.3s80%
28Hermes 3 405B95.9%$0.001325.3s82%
29Gemma 3 12B93.9%$0.000110.2s78%
30Ministral 3 14B93.2%$0.00034.5s77%
31Claude 3.5 Sonnet97.5%$0.02410.6s91%
32GPT-4o, Aug. 6th (temp=0)96.3%$0.00743.0s73%
33Llama 3.1 8B93.0%$0.000111.5s76%
34Ministral 3 3B92.0%$0.00012.3s74%
35Mistral Small Creative91.6%$0.00023.3s74%
36DeepSeek V3.296.4%$0.000552.3s82%
37GPT-5 Mini96.8%$0.007446.2s86%
38Gemma 3 27B93.0%$0.000218.5s76%
39Z.AI GLM 4.696.1%$0.005950.0s86%
40GPT-5.197.7%$0.02122.6s86%
41Ministral 3B90.9%$0.00012.4s72%
42Llama 3.1 70B93.0%$0.000627.8s76%
43Minimax M2.596.1%$0.00191.2m88%
44DeepSeek V3 (2024-12-26)94.6%$0.000917.7s67%
45Llama 3.1 Nemotron 70B91.4%$0.001616.4s75%
46Z.AI GLM 4.593.0%$0.004436.1s80%
47GPT-5.293.4%$0.01310.9s78%
48Gemini 2.5 Flash (Reasoning)92.9%$0.008715.8s76%
49ByteDance Seed 1.695.5%$0.00571.0m84%
50Grok 499.1%$0.03341.7s94%
51GPT-4o Mini (temp=1)89.3%$0.000510.5s70%
52Ministral 3 8B88.4%$0.00023.6s69%
53Ministral 8B87.9%$0.00013.7s71%
54GPT-4o, Aug. 6th (temp=1)93.7%$0.00733.1s63%
55Gemini 2.5 Pro98.8%$0.03928.9s93%
56DeepSeek-V2 Chat93.4%$0.000919.1s61%
57GPT-4o Mini (temp=0)88.7%$0.000410.7s69%
58Arcee AI: Trinity Mini92.7%$0.00027.9s55%
59GPT-4.1 Nano88.0%$0.00034.1s66%
60Z.AI GLM 599.3%$0.0141.9m95%
61Gemini 3 Pro (Preview)100.0%$0.05335.9s100%
62ByteDance Seed 1.6 Flash88.9%$0.000917.2s61%
63Aion 2.095.2%$0.00511.1m69%
64Mistral NeMO90.2%$0.00022.8s50%
65WizardLM 2 8x22b92.1%$0.000951.4s65%
66Claude Opus 4.6 (Reasoning)100.0%$0.06329.0s100%
67DeepSeek V3 (2025-03-24)92.1%$0.000740.7s56%
68Gemini 2.5 Flash Lite (Reasoning)88.4%$0.003026.8s62%
69Z.AI GLM 4.798.8%$0.0112.4m91%
70GPT-5 Nano90.9%$0.00351.3m75%
71Claude Opus 497.3%$0.0609.4s91%
72GPT-598.0%$0.03959.9s86%
73o4 Mini High95.9%$0.03452.1s79%
74Claude Sonnet 4.6 (Reasoning)99.6%$0.06441.7s96%
75o4 Mini91.6%$0.01929.9s62%
76Qwen 3.5 397B A17B98.8%$0.0113.0m91%
77DeepSeek V3.185.0%$0.000737.9s41%
78Z.AI GLM 4.7 Flash87.3%$0.00231.6m55%
79Gemini 3.1 Pro (Preview)100.0%$0.0861.3m100%
80MoonshotAI: Kimi K2.596.2%$0.0183.8m87%
81Cohere Command R+ (Aug. 2024)72.0%$0.007633.3s31%
82Rocinante 12B65.0%$0.00049.2s16%
83Claude 3 Haiku64.8%$0.00105.1s7%
84Hermes 3 70B63.9%$0.00162.1m10%
93.78%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
GPT-5 Mini1001001001001001009098.6%
MoonshotAI: Kimi K2.51001001001001001009098.6%
o4 Mini High1001001001001001009098.6%
o4 Mini1001001001001001009098.6%
Grok 41001001001001001009098.6%
Claude Opus 41001001001001001009098.6%
Ministral 3 14B1001001001001001009098.6%
Z.AI GLM 4.7100100100100100909097.1%
Ministral 3 3B10010010010090909095.7%
GPT-51001001001001001007095.7%
Grok 4 Fast10010010010090909095.7%
Gemini 2.5 Flash1001001001001001007095.7%
Gemma 3 4B10010010010090909095.7%
ByteDance Seed 1.610010010010090908094.3%
Z.AI GLM 4.7 Flash1001001009090909094.3%
WizardLM 2 8x22b10010010010090908094.3%
Ministral 3B10010010010090908094.3%
Minimax M2.5100100909090909092.9%
Z.AI GLM 4.6100100909090908091.4%
GPT-4o Mini (temp=1)10090909090909091.4%
GPT-4o Mini (temp=0)9090909090909090.0%
ByteDance Seed 1.6 Flash10010010010090804087.1%
GPT-5.21001001009080707087.1%
Grok 4.1 Fast100100909090805085.7%
Gemini 2.5 Flash Lite (Reasoning)10010010010070706085.7%
Mistral NeMO100100808080808085.7%
Llama 3.1 70B10080808080808082.9%
Llama 3.1 8B100100908070707082.9%
Z.AI GLM 4.59090908070707080.0%
GPT-5 Nano9090807070707077.1%
Llama 3.1 Nemotron 70B9080808070707077.1%
Arcee AI: Trinity Mini1001001001001000071.4%
Ministral 8B8080807070606071.4%
Ministral 3 8B8070707070605067.1%
Cohere Command R+ (Aug. 2024)10050202020101032.9%
Claude 3 Haiku1001000000028.6%
Rocinante 12B902020201010024.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Minimax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
DeepSeek V3 (2025-03-24)1001001001001001009098.6%
Ministral 3 3B10010010010090909095.7%
Ministral 3B10090909090909091.4%
DeepSeek V3.1100100100100100100085.7%
Rocinante 12B10010010010010080082.9%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
GPT-5 Mini1001001001001001009098.6%
Z.AI GLM 51001001001001001009098.6%
Grok 4.1 Fast1001001001001001009098.6%
GPT-4.11001001001001001009098.6%
Gemini 2.5 Pro1001001001001001009098.6%
Gemini 3 Flash (Preview)1001001001001001009098.6%
Aion 2.0100100100100100909097.1%
Z.AI GLM 4.7 Flash100100100100100909097.1%
Gemini 3 Flash (Preview, Reasoning)100100100100100909097.1%
DeepSeek-V2 Chat100100100100100909097.1%
DeepSeek V3 (2024-12-26)100100100100100909097.1%
GPT-5 Nano10010010010090909095.7%
GPT-5.110010010010090909095.7%
Grok 4 Fast10010010010090909095.7%
DeepSeek V3.210010010010090909095.7%
Grok 41001001009090909094.3%
Claude Haiku 4.51001001009090909094.3%
Arcee AI: Trinity Mini1001001009090909094.3%
ByteDance Seed 1.61001001009090909094.3%
Hermes 3 405B10010010010090908094.3%
DeepSeek V3 (2025-03-24)10010010010090908094.3%
Minimax M2.5100100909090909092.9%
GPT-4o, May 13th (temp=0)100100909090909092.9%
MoonshotAI: Kimi K2.5100100909090909092.9%
Gemma 3 4B1001001009090908092.9%
GPT-4o, May 13th (temp=1)10090909090909091.4%
ByteDance Seed 1.6 Flash10090909090909091.4%
GPT-5.210090909090909091.4%
Z.AI GLM 4.610090909090909091.4%
Z.AI GLM 4.510090909090909091.4%
Gemini 2.5 Flash Lite (Reasoning)10090909090909091.4%
Writer: Palmyra X510090909090909091.4%
Arcee AI: Trinity Large (Preview)10090909090909091.4%
GPT-4.1 Nano1001001009090808091.4%
Claude Sonnet 49090909090909090.0%
Claude Sonnet 4.59090909090909090.0%
Claude Opus 49090909090909090.0%
Qwen 3.5 Plus (2026-02-15)9090909090909090.0%
Mistral Large 39090909090909090.0%
Claude 3.5 Sonnet9090909090909090.0%
Claude 3.7 Sonnet9090909090909090.0%
GPT-4.1 Mini9090909090909090.0%
Mistral Large 29090909090909090.0%
Gemini 2.5 Flash Lite9090909090909090.0%
Gemini 2.5 Flash9090909090909090.0%
Mistral Large9090909090909090.0%
GPT-4o Mini (temp=1)9090909090909090.0%
GPT-4o Mini (temp=0)9090909090909090.0%
Hermes 3 70B100100909090808090.0%
o4 Mini High100100100100100904090.0%
Gemma 3 27B9090909090908088.6%
Gemini 2.5 Flash (Reasoning)10090909090808088.6%
Gemma 3 12B9090909090908088.6%
Llama 3.1 8B1001001009080707087.1%
DeepSeek V3.11001001001001001001087.1%
Llama 3.1 70B9090909090808087.1%
GPT-4o, Aug. 6th (temp=0)100100100909090081.4%
Llama 3.1 Nemotron 70B9090808080706078.6%
GPT-4o, Aug. 6th (temp=1)909090909090077.1%
o4 Mini10090909090403075.7%
WizardLM 2 8x22b1009090909040071.4%
Mistral Medium 3.17070707070707070.0%
Ministral 3 14B7070707070707070.0%
Ministral 3 8B7070707070707070.0%
Ministral 8B7070707070706068.6%
Ministral 3 3B8070707060606067.1%
Ministral 3B7070707060605064.3%
Mistral Small Creative7070706060606064.3%
Mistral NeMO8080702000035.7%
Rocinante 12B60302020100020.0%
Cohere Command R+ (Aug. 2024)3020101010101014.3%
Claude 3 Haiku00000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100909097.1%
Qwen 2.5 72B100100100100100909097.1%
Ministral 8B1001001001001001008097.1%
Z.AI GLM 510010010010090909095.7%
Gemini 3 Flash (Preview, Reasoning)100100100100100908095.7%
Gemini 2.5 Flash Lite10010010010090909095.7%
Ministral 3 3B1001001009090909094.3%
Z.AI GLM 4.710010010010090907092.9%
GPT-4o, May 13th (temp=1)100100909090909092.9%
Llama 3.1 8B100100909090909092.9%
Gemini 2.5 Pro10090909090909091.4%
GPT-4.1 Nano100100909090908091.4%
Qwen 3.5 397B A17B100100909090907090.0%
Claude Sonnet 4.69090909090909090.0%
Claude Opus 49090909090909090.0%
GPT-4o, May 13th (temp=0)9090909090909090.0%
Gemini 3 Flash (Preview)9090909090909090.0%
Claude Haiku 4.59090909090909090.0%
GPT-4o, Aug. 6th (temp=0)9090909090909090.0%
Llama 3.1 70B9090909090909090.0%
Mistral Medium 3.19090909090909090.0%
Llama 3.1 Nemotron 70B9090909090909090.0%
Ministral 3B100100909090808090.0%
Minimax M2.5100100909090907090.0%
GPT-4o, Aug. 6th (temp=1)10090909090908090.0%
Mistral Small Creative10090909090907088.6%
GPT-510010010010080707088.6%
Z.AI GLM 4.61001001008080807087.1%
MoonshotAI: Kimi K2.510090909090707085.7%
Hermes 3 70B9090909080808085.7%
GPT-5.11001001008080707085.7%
Arcee AI: Trinity Mini100100100100100100085.7%
Grok 4.1 Fast100100908070707082.9%
Grok 4 Fast10090908080707082.9%
GPT-5 Mini9090908080707081.4%
DeepSeek-V2 Chat9090909070707081.4%
GPT-4.1 Mini9090808080807081.4%
Aion 2.010090908080705080.0%
Gemini 2.5 Flash9090808080806080.0%
Cohere Command R+ (Aug. 2024)1001001009080504080.0%
o4 Mini High9090808080707080.0%
ByteDance Seed 1.69080808080707078.6%
ByteDance Seed 1.6 Flash8080808080807078.6%
Z.AI GLM 4.59080808070707077.1%
Z.AI GLM 4.7 Flash9090808080606077.1%
o4 Mini9090808070705075.7%
DeepSeek V3 (2025-03-24)9090909090701075.7%
DeepSeek V3 (2024-12-26)9090909070505075.7%
Hermes 3 405B8080808080706075.7%
DeepSeek V3.29090807070706075.7%
GPT-4.19080707070707074.3%
WizardLM 2 8x22b9090807070606074.3%
Gemini 2.5 Flash (Reasoning)9090807070505071.4%
GPT-4o Mini (temp=1)8070707070707071.4%
Rocinante 12B1001001001008020071.4%
GPT-5.27070707070707070.0%
GPT-4o Mini (temp=0)7070707070707070.0%
Gemma 3 12B8080808070505070.0%
GPT-5 Nano8070707070605067.1%
Gemma 3 27B9080806060505067.1%
Gemini 2.5 Flash Lite (Reasoning)9080706050505064.3%
DeepSeek V3.1707050505050048.6%
Claude 3 Haiku00000000.0%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Minimax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
Rocinante 12B100100100100100100100100.0%
ByteDance Seed 1.61001001001001001009098.6%
Grok 4 Fast1001001001001001009098.6%
Gemma 3 27B1001001001001001009098.6%
Llama 3.1 8B1001001001001001009098.6%
GPT-4.1 Mini100100100100100909097.1%
GPT-5 Nano100100100100100909097.1%
Ministral 8B1001001001001001008097.1%
Llama 3.1 70B100100100100100907094.3%
GPT-4o Mini (temp=1)100100909090909092.9%
GPT-4o Mini (temp=0)9090909090909090.0%
Gemini 2.5 Flash Lite (Reasoning)1001001009080807088.6%
DeepSeek V3.11001001001001001002088.6%
o4 Mini100100100100100100085.7%
Gemini 2.5 Flash (Reasoning)100100909080707085.7%
DeepSeek V3 (2024-12-26)100100100100100100085.7%
DeepSeek V3 (2025-03-24)100100100100100100085.7%
ByteDance Seed 1.6 Flash10080808080601070.0%
Z.AI GLM 4.7 Flash10090908070301067.1%
GPT-4.1 Nano9080606060606067.1%
Hermes 3 70B9000000012.9%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Minimax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
GPT-5 Mini1001001001001001009098.6%
Gemini 3 Flash (Preview, Reasoning)1001001001001001009098.6%
DeepSeek V3.11001001001001001005092.9%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100085.7%
Mistral Small 3.2 24B100100100100100100085.7%
Rocinante 12B10010010010010070081.4%
Hermes 3 70B1001001009000055.7%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
o4 Mini High1001001001001001009098.6%
GPT-5.21001001001001001009098.6%
Z.AI GLM 4.61001001001001001009098.6%
Grok 4 Fast1001001001001001009098.6%
DeepSeek V3 (2024-12-26)1001001001001001009098.6%
GPT-4o, Aug. 6th (temp=0)1001001001001001009098.6%
GPT-5 Mini100100100100100909097.1%
Gemini 2.5 Flash Lite (Reasoning)100100100100100909097.1%
Writer: Palmyra X5100100100100100909097.1%
WizardLM 2 8x22b100100100100100909097.1%
Gemini 2.5 Flash (Reasoning)100100100100100909097.1%
ByteDance Seed 1.6 Flash100100100100100909097.1%
MoonshotAI: Kimi K2.510010010010090909095.7%
Z.AI GLM 4.510010010010090909095.7%
Claude Haiku 4.510010010010090909095.7%
Gemini 2.5 Flash10010010010090909095.7%
GPT-4o Mini (temp=1)10010010010090909095.7%
GPT-5 Nano10010010010090908094.3%
Minimax M2.5100100909090909092.9%
Gemma 3 12B100100909090909092.9%
Arcee AI: Trinity Mini100100909090909092.9%
Claude Sonnet 4100100909090909092.9%
Qwen 3.5 Plus (2026-02-15)100100909090909092.9%
GPT-4o, May 13th (temp=0)100100909090909092.9%
GPT-4o, May 13th (temp=1)100100909090909092.9%
Mistral Small 3.2 24B100100909090909092.9%
Arcee AI: Trinity Large (Preview)10090909090909091.4%
Mistral Large 39090909090909090.0%
Claude 3.5 Sonnet9090909090909090.0%
Claude 3.7 Sonnet9090909090909090.0%
GPT-4.1 Mini9090909090909090.0%
Mistral Large 29090909090909090.0%
Gemini 2.5 Flash Lite9090909090909090.0%
Mistral Large9090909090909090.0%
Llama 3.1 70B9090909090909090.0%
Gemma 3 27B9090909090909090.0%
Mistral Medium 3.19090909090909090.0%
Claude 3 Haiku9090909090909090.0%
Ministral 3 3B9090909090908088.6%
Ministral 3B10090909090808088.6%
DeepSeek V3.11001001001001001001087.1%
Llama 3.1 Nemotron 70B9090909080808085.7%
Z.AI GLM 4.7 Flash10010010010090902085.7%
Aion 2.010010010010010090084.3%
Llama 3.1 8B100100909080804082.9%
GPT-4o, Aug. 6th (temp=1)1001001001009090082.9%
DeepSeek V3 (2025-03-24)10010010010010080082.9%
GPT-4.1 Nano9090808080707080.0%
Mistral Small Creative8080808080808080.0%
Ministral 3 14B8080808080707077.1%
Cohere Command R+ (Aug. 2024)100100707060604071.4%
Ministral 3 8B7070707070707070.0%
Ministral 8B7070707070707070.0%
DeepSeek-V2 Chat10010010090900068.6%
Rocinante 12B100909060400054.3%
Hermes 3 70B90800000024.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Minimax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
ByteDance Seed 1.61001001001001001009098.6%
Grok 4.1 Fast1001001001001001009098.6%
Ministral 8B1001001001001001009098.6%
Ministral 3B1001001001001001009098.6%
Hermes 3 405B100100100100100909097.1%
MoonshotAI: Kimi K2.5100100100100100909097.1%
o4 Mini100100100100100909097.1%
Arcee AI: Trinity Mini100100100100100909097.1%
GPT-5 Nano10010010010090909095.7%
Ministral 3 3B1001001009090909094.3%
Gemini 2.5 Flash Lite (Reasoning)100100100100100907094.3%
DeepSeek V3.11001001001001001003090.0%
ByteDance Seed 1.6 Flash10090909080808087.1%
Gemma 3 4B9090909080808085.7%
Rocinante 12B100100100100100100085.7%
Z.AI GLM 4.7 Flash10090908080802077.1%
Cohere Command R+ (Aug. 2024)100100908060605077.1%
GPT-4.1 Nano8080807070707074.3%
GPT-4o Mini (temp=1)8080707070707072.9%
GPT-4o Mini (temp=0)7070707070707070.0%
Hermes 3 70B100100100000042.9%