Name replacement accuracy

Test: Text Replacement

Avg. Score
95.1%
Scenarios
14

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Grok 4 Fast99.6%$0.00075.8s98%
2Claude Haiku 4.599.7%$0.00342.8s98%
3Grok 4.1 Fast99.5%$0.000810.6s98%
4GPT-4.1 Mini98.9%$0.00106.6s95%
5Qwen 3.5 Plus (2026-02-15)99.2%$0.00156.7s95%
6Mistral Large 398.7%$0.00117.4s96%
7Gemini 3 Flash (Preview)98.4%$0.00183.2s94%
8GPT-4.199.4%$0.00524.1s97%
9GPT-4o Mini (temp=1)98.2%$0.00049.2s94%
10Qwen 2.5 72B98.4%$0.000310.3s94%
11Gemini 2.5 Flash Lite98.1%$0.00031.6s88%
12GPT-4o Mini (temp=0)98.0%$0.00049.2s93%
13Mistral Large98.8%$0.00427.2s96%
14Mistral Large 298.8%$0.00427.2s96%
15Gemma 3 27B98.6%$0.000216.0s94%
16GPT-4o, Aug. 6th (temp=0)99.0%$0.00642.6s94%
17DeepSeek-V2 Chat98.4%$0.000816.5s94%
18Gemini 2.5 Flash (Reasoning)99.1%$0.005410.7s97%
19Claude Sonnet 499.8%$0.0105.7s99%
20GPT-4o, May 13th (temp=0)99.7%$0.0103.5s98%
21Claude Sonnet 4.599.6%$0.0104.6s98%
22Gemini 2.5 Flash98.7%$0.00142.1s82%
23GPT-4o, Aug. 6th (temp=1)98.7%$0.00642.6s92%
24Llama 3.1 8B96.5%$0.00009.3s89%
25Mistral Medium 3.196.2%$0.00126.0s89%
26Z.AI GLM 4.599.7%$0.003126.1s98%
27Claude 3.7 Sonnet99.4%$0.0105.5s97%
28GPT-4o, May 13th (temp=1)99.2%$0.0103.3s96%
29Claude Sonnet 4.699.2%$0.0104.4s96%
30GPT-5.298.8%$0.00876.0s94%
31Llama 3.1 Nemotron 70B97.7%$0.001413.9s89%
32Mistral Small 3.2 24B97.7%$0.00024.7s80%
33Gemma 3 12B97.3%$0.00018.6s82%
34Llama 3.1 70B98.0%$0.000523.9s91%
35Gemini 3 Flash (Preview, Reasoning)99.5%$0.009717.2s98%
36GPT-5.199.5%$0.01313.7s98%
37GPT-5 Mini99.3%$0.005233.1s97%
38DeepSeek V3 (2024-12-26)97.6%$0.000715.0s80%
39DeepSeek V3.298.9%$0.000444.5s96%
40Claude Opus 4.699.5%$0.0175.6s98%
41Claude Opus 4.599.3%$0.0175.2s97%
42Z.AI GLM 4.699.7%$0.004542.4s98%
43ByteDance Seed 1.699.4%$0.004043.2s98%
44ByteDance Seed 1.6 Flash95.5%$0.000712.2s76%
45Aion 2.099.5%$0.003848.1s97%
46GPT-5 Nano98.8%$0.002350.9s96%
47Claude 3.5 Sonnet99.3%$0.0209.0s97%
48Ministral 3 14B93.4%$0.00023.9s70%
49Mistral Small Creative92.7%$0.00022.9s71%
50Hermes 3 405B97.3%$0.001222.3s74%
51Gemini 2.5 Flash Lite (Reasoning)96.3%$0.001816.2s71%
52Writer: Palmyra X594.3%$0.003411.1s73%
53Z.AI GLM 599.5%$0.008360.0s97%
54o4 Mini97.7%$0.01321.3s80%
55Grok 499.6%$0.02328.7s98%
56Gemini 2.5 Pro99.6%$0.02719.7s98%
57WizardLM 2 8x22b96.2%$0.000737.2s70%
58o4 Mini High99.1%$0.02133.0s96%
59Claude Opus 4.6 (Reasoning)99.6%$0.03012.5s98%
60Minimax M2.597.7%$0.001859.0s80%
61Arcee AI: Trinity Mini88.5%$0.00026.5s58%
62GPT-4.1 Nano89.9%$0.00033.6s52%
63Claude Sonnet 4.6 (Reasoning)99.6%$0.03219.3s98%
64Z.AI GLM 4.799.7%$0.00701.5m97%
65Ministral 3 8B85.7%$0.00023.3s52%
66Mistral NeMO85.5%$0.00022.5s51%
67DeepSeek V3 (2025-03-24)92.5%$0.000635.4s58%
68GPT-599.8%$0.02944.3s99%
69Ministral 8B84.8%$0.00013.3s50%
70Gemini 3 Pro (Preview)99.5%$0.03826.2s98%
71Z.AI GLM 4.7 Flash94.0%$0.00151.1m68%
72Claude Opus 499.8%$0.0517.7s98%
73Arcee AI: Trinity Large (Preview)88.0%$0.000023.8s42%
74DeepSeek V3.188.5%$0.000634.3s46%
75Ministral 3 3B78.9%$0.00012.0s43%
76MoonshotAI: Kimi K2.599.6%$0.0112.1m98%
77Ministral 3B78.4%$0.00002.0s39%
78Qwen 3.5 397B A17B99.4%$0.00892.2m97%
79Gemini 3.1 Pro (Preview)98.8%$0.04542.6s87%
80Gemma 3 4B73.9%$0.00016.0s24%
81Rocinante 12B63.8%$0.00048.0s17%
82Cohere Command R+ (Aug. 2024)71.4%$0.006727.2s23%
83Claude 3 Haiku61.2%$0.00094.3s4%
84Hermes 3 70B67.7%$0.00152.0m10%
95.15%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Minimax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
DeepSeek V3 (2025-03-24)1001001001001001007596.4%
Rocinante 12B10010010010092837592.9%
Ministral 3 3B9292929292929291.7%
Ministral 3B100100929292925889.3%
Mistral NeMO100100100100100100886.9%
DeepSeek V3.110010010010010092885.7%
Arcee AI: Trinity Mini10010092929283079.8%
Cohere Command R+ (Aug. 2024)1001008000029.8%
Claude 3 Haiku1001000000028.6%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Grok 4.1 Fast1001001001001001009899.7%
Grok 4 Fast100100100100100989899.5%
Z.AI GLM 4.510010010010098989899.2%
ByteDance Seed 1.6100100989898989898.7%
GPT-5.21001001009898989698.7%
GPT-5 Nano1001001009898989698.7%
Z.AI GLM 4.710098989898989898.5%
Claude Opus 41001001009898969698.5%
Grok 410098989898989698.2%
Claude Opus 4.6 (Reasoning)9898989898989898.2%
Claude Opus 4.69898989898989898.2%
GPT-510098989898989698.2%
Claude Sonnet 49898989898989898.2%
Z.AI GLM 4.69898989898989698.0%
Gemini 2.5 Flash Lite (Reasoning)1001001009898969398.0%
Claude Haiku 4.510098989896969697.7%
Aion 2.09898989898969697.7%
Minimax M2.59898989898969697.7%
Gemini 2.5 Pro9898989896969697.5%
Claude Sonnet 4.6 (Reasoning)10098989696969697.5%
GPT-5.19898989898969597.5%
MoonshotAI: Kimi K2.510098969696969697.2%
Gemini 3 Pro (Preview)9898989696969697.2%
Claude Sonnet 4.59898989696969697.2%
DeepSeek V3.19898969696969697.0%
GPT-4.1 Mini9898989696969396.7%
Hermes 3 405B9896969696969696.7%
Gemini 3.1 Pro (Preview)9696969696969696.5%
Qwen 3.5 397B A17B9696969696969696.5%
Claude Sonnet 4.69696969696969696.5%
Claude Opus 4.59696969696969696.5%
Qwen 3.5 Plus (2026-02-15)9696969696969696.5%
GPT-4o, May 13th (temp=0)9696969696969696.5%
Claude 3.5 Sonnet9696969696969696.5%
Claude 3.7 Sonnet9696969696969696.5%
GPT-4o, Aug. 6th (temp=0)9696969696969696.5%
DeepSeek V3.29696969696969696.5%
GPT-4o Mini (temp=1)9696969696969696.5%
GPT-4o Mini (temp=0)9696969696969696.5%
Gemini 3 Flash (Preview, Reasoning)9896969696969396.2%
GPT-5 Mini9898969696959396.2%
Hermes 3 70B9696969696969596.2%
Z.AI GLM 59696969696969396.0%
Gemma 3 27B9896969695959596.0%
Gemini 2.5 Flash Lite9696969695959595.7%
GPT-4.19696969695959595.7%
Z.AI GLM 4.7 Flash9896969595959395.5%
Qwen 2.5 72B9696969595959595.5%
Gemini 3 Flash (Preview)9696969595959395.2%
GPT-4o, May 13th (temp=1)9896969595939395.2%
WizardLM 2 8x22b10098969693918995.0%
Gemini 2.5 Flash (Reasoning)9696959595939394.7%
GPT-4o, Aug. 6th (temp=1)9896959595919194.5%
o4 Mini High9896959595918994.2%
Mistral Small 3.2 24B9595939393939393.5%
GPT-4.1 Nano9695959393918993.2%
Mistral Large 39393939393939393.0%
Mistral Large 29393939393939393.0%
Mistral Large9393939393939393.0%
DeepSeek V3 (2024-12-26)9393939393918992.2%
DeepSeek-V2 Chat9393939191919192.0%
o4 Mini9595939391898892.0%
DeepSeek V3 (2025-03-24)9393919191918991.5%
Mistral Medium 3.19191888888888888.7%
Llama 3.1 70B9389848484828185.5%
Gemini 2.5 Flash9898989898951185.2%
Llama 3.1 Nemotron 70B8686868684828284.7%
Mistral Small Creative9382828282828183.7%
Gemma 3 12B9696959595951183.2%
Llama 3.1 8B8988888479777583.0%
ByteDance Seed 1.6 Flash9393888481812677.9%
Arcee AI: Trinity Large (Preview)9696969696111171.9%
Ministral 3 3B8984707067635871.7%
Mistral NeMO7775727070686771.4%
Ministral 3B8989796358494767.9%
Writer: Palmyra X57467656161605863.7%
Ministral 3 14B6363636361585661.2%
Arcee AI: Trinity Mini96827979777460.7%
Ministral 8B5351474444444246.4%
Ministral 3 8B5449464444424045.6%
Claude 3 Haiku93890000026.1%
Cohere Command R+ (Aug. 2024)891612420017.5%
Rocinante 12B93129420017.0%
Gemma 3 4B1111111111111110.5%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Hermes 3 70B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
ByteDance Seed 1.61001001001001001009899.7%
o4 Mini High1001001001001001009899.7%
GPT-4.11001001001001001009899.7%
Grok 4 Fast1001001001001001009899.7%
Z.AI GLM 4.7 Flash1001001001001001009899.7%
DeepSeek V3.21001001001001001009899.7%
DeepSeek V3 (2025-03-24)1001001001001001009899.7%
Llama 3.1 70B1001001001001001009899.7%
ByteDance Seed 1.6 Flash1001001001001001009899.7%
Claude Sonnet 4.6100100100100100989899.3%
Aion 2.0100100100100100989899.3%
o4 Mini100100100100100989899.3%
DeepSeek V3.11001001001001001009599.3%
GPT-5 Nano10010010010098989899.0%
GPT-5 Mini10010010010098989598.7%
Gemma 3 27B10010010010098989598.7%
Qwen 2.5 72B1001001009898989898.7%
Claude Sonnet 4.6 (Reasoning)100100989898989898.3%
Gemini 2.5 Flash Lite (Reasoning)100100989898989898.3%
GPT-4.1 Mini9898989898959597.0%
Mistral Medium 3.19595959595959595.3%
Mistral NeMO9895959595939395.0%
Llama 3.1 8B10095959591919194.0%
Gemma 3 12B9893939393939393.7%
Ministral 3 14B9595939393939393.7%
GPT-4.1 Nano9393939191919191.7%
Mistral Small Creative9191919191919190.7%
Arcee AI: Trinity Mini10095959393936090.0%
Ministral 3 8B9393939388887288.7%
Ministral 8B9593939388726786.0%
Minimax M2.5100100100989898285.0%
Arcee AI: Trinity Large (Preview)1009553515151257.8%
Rocinante 12B958674402116247.8%
Gemma 3 4B6735353333302837.2%
Ministral 3B3737373737282634.2%
Ministral 3 3B3737333330282832.2%
Cohere Command R+ (Aug. 2024)6037373328161432.2%
Claude 3 Haiku93930000026.6%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Minimax M2.5100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Z.AI GLM 4.71001001001001001008898.2%
DeepSeek-V2 Chat1001001001001001008898.2%
GPT-4o, May 13th (temp=1)1001001001001001008898.2%
DeepSeek V3.11001001001001001008898.2%
ByteDance Seed 1.6 Flash1001001001001001008898.2%
Claude Sonnet 4.6100100100100100888896.4%
Z.AI GLM 4.7 Flash100100100100100888896.4%
Hermes 3 405B100100100100100888896.4%
Llama 3.1 8B1001001001001001007596.4%
Gemma 3 27B10010010010088888894.6%
Arcee AI: Trinity Large (Preview)10010010010088888894.6%
Hermes 3 70B10010010010088888894.6%
GPT-4.1 Nano100100100100100886392.9%
Gemini 3 Flash (Preview)100100888888888891.1%
GPT-4o Mini (temp=1)100100888888888891.1%
Qwen 2.5 72B100100888888888891.1%
DeepSeek V3 (2025-03-24)100100100100100883889.3%
GPT-4o Mini (temp=0)8888888888888887.5%
Mistral Medium 3.18888888888888887.5%
Mistral Small Creative8888888888888887.5%
Ministral 3 14B8888888888888887.5%
Arcee AI: Trinity Mini8888888888888887.5%
Gemma 3 4B8888888888888887.5%
Ministral 3 3B6350505038383846.4%
Ministral 3B8863503838252546.4%
Mistral NeMO7550383838252541.1%
Ministral 3 8B3838383838382535.7%
Cohere Command R+ (Aug. 2024)7550502525131335.7%
Rocinante 12B885050251313033.9%
Ministral 8B3838383825251330.4%
Claude 3 Haiku00000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
o4 Mini High1001001001001001009799.6%
Mistral Medium 3.11001001001001001009799.6%
o4 Mini100100100100100979799.2%
Writer: Palmyra X51001001001001001009599.2%
Llama 3.1 70B1001001001001001009599.2%
Claude 3 Haiku1001001001001001009599.2%
Arcee AI: Trinity Mini100100100100100979799.2%
Ministral 3 3B100100100100100979799.2%
GPT-4o, Aug. 6th (temp=1)10010010010097979798.8%
Qwen 3.5 397B A17B100100100100100959598.5%
GPT-5.210010010010097979598.5%
GPT-5 Mini1001001009797979598.1%
Ministral 3B1001001009797979598.1%
Qwen 2.5 72B10010010010095959597.7%
Minimax M2.510097979797979597.3%
Gemini 2.5 Flash (Reasoning)9797979797979797.3%
Mistral Large1001001009595959596.9%
DeepSeek-V2 Chat1001001009595959596.9%
DeepSeek V3 (2025-03-24)1001001009595959296.5%
Hermes 3 70B100100959595959596.1%
DeepSeek V3 (2024-12-26)100100959595959596.1%
GPT-5 Nano100100979792929295.8%
ByteDance Seed 1.6 Flash9797979797958995.8%
Ministral 3 14B9797959595959595.4%
WizardLM 2 8x22b10095959595959595.4%
Gemini 2.5 Flash Lite (Reasoning)10095959595929294.6%
Mistral Large 39595959595959594.6%
Mistral Large 29595959595959594.6%
DeepSeek V3.29595959595959594.6%
GPT-4o Mini (temp=1)9595959595959594.6%
Mistral Small 3.2 24B9595959595959594.6%
Gemma 3 12B9595959595959594.6%
GPT-4o Mini (temp=0)9595959595959594.6%
Ministral 3 8B9595959595959594.6%
Mistral NeMO9595959595959594.6%
Ministral 8B9595959595959294.2%
Z.AI GLM 4.7 Flash10097959595868693.4%
Cohere Command R+ (Aug. 2024)9595959595957892.3%
Llama 3.1 8B9595959595928192.3%
Gemini 2.5 Flash Lite8686868686865181.5%
DeepSeek V3.1959595959595081.1%
Rocinante 12B1009289865151067.2%
Arcee AI: Trinity Large (Preview)10000000014.3%
GPT-4.1 Nano513333309.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Minimax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001009699.4%
GPT-5 Mini1001001001001001009699.4%
Z.AI GLM 4.51001001001001001009699.4%
Gemini 2.5 Flash Lite (Reasoning)1001001001001001009699.4%
DeepSeek V3 (2024-12-26)1001001001001001009699.4%
Mistral Large1001001001001001009699.4%
Z.AI GLM 4.7 Flash1001001001001001009298.9%
o4 Mini High100100100100100969298.3%
Aion 2.01001001001001001008898.3%
GPT-4o, May 13th (temp=1)100100100100100968897.7%
GPT-5 Nano100100100100100968897.7%
Llama 3.1 8B10010010010096969297.7%
o4 Mini1001001009696929296.6%
DeepSeek-V2 Chat1001001009696929296.6%
GPT-4.1 Mini100100100100100928496.6%
Claude Opus 4.6 (Reasoning)9696969696969696.0%
Claude Opus 4.69696969696969696.0%
Claude Opus 4.59696969696969696.0%
ByteDance Seed 1.6 Flash1001001009696928896.0%
Ministral 3 14B9696969696969696.0%
GPT-4.1 Nano10010010010096888495.4%
Hermes 3 70B10010010010096967294.9%
Gemma 3 4B9696969696929294.9%
Arcee AI: Trinity Mini9696929292928892.6%
Llama 3.1 70B10096969692848092.0%
GPT-4o, Aug. 6th (temp=1)1001001009288767289.7%
GPT-4o, Aug. 6th (temp=0)10088888888888889.7%
Hermes 3 405B1001001001001001002088.6%
Llama 3.1 Nemotron 70B9696928888846486.9%
Mistral Medium 3.18888848484848485.1%
DeepSeek V3 (2025-03-24)1001001009288881683.4%
WizardLM 2 8x22b1001001001002016062.3%
Writer: Palmyra X58076726052524061.7%
Ministral 3 8B5252484848444047.4%
Mistral Small Creative7656484040363246.9%
Ministral 8B5252484440403644.6%
Mistral NeMO76726436324040.6%
Rocinante 12B100682012128031.4%
Ministral 3 3B5248484488830.9%
Ministral 3B886016888427.4%
Cohere Command R+ (Aug. 2024)922016884421.7%
Claude 3 Haiku00000000.0%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Sonnet 4.6100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)1001001001001001009899.7%
Claude Sonnet 4.51001001001001001009899.7%
GPT-4o, May 13th (temp=0)1001001001001001009899.7%
Claude Haiku 4.51001001001001001009899.7%
Claude Opus 4.6 (Reasoning)100100100100100989899.5%
GPT-510010010010098989899.2%
Claude Sonnet 410010010010098989899.2%
Z.AI GLM 4.7100100100100100989699.2%
GPT-4o, May 13th (temp=1)10010010010098989899.2%
GPT-4o, Aug. 6th (temp=1)10010010010098989899.2%
GPT-5.11001001009898989898.9%
Grok 41001001009898989898.9%
Writer: Palmyra X5100100989898989898.7%
Llama 3.1 70B1001001009898989698.7%
Qwen 2.5 72B100100989898989898.7%
Z.AI GLM 4.6100100989898989698.4%
Z.AI GLM 5100100989898989698.4%
MoonshotAI: Kimi K2.510098989898989898.4%
Claude Opus 4.510098989898989898.4%
Qwen 3.5 Plus (2026-02-15)10098989898989898.4%
Gemini 3 Flash (Preview)1001001009898969698.4%
Z.AI GLM 4.510098989898989698.1%
Claude Opus 4.69898989898989898.1%
Aion 2.010010010010098969298.1%
Gemini 3 Pro (Preview)9898989898989898.1%
Claude Opus 410098989898989698.1%
Mistral Large 39898989898989898.1%
Claude 3.7 Sonnet9898989898989898.1%
Hermes 3 405B10098989898989698.1%
Mistral Large 29898989898989898.1%
Mistral Large9898989898989898.1%
Mistral Small 3.2 24B9898989898989898.1%
Arcee AI: Trinity Large (Preview)9898989898989898.1%
Mistral Small Creative10098989898989698.1%
Hermes 3 70B10098989898989698.1%
GPT-5 Mini100100989898969497.8%
Qwen 3.5 397B A17B9898989898989697.8%
Gemini 3 Flash (Preview, Reasoning)9898989898989697.8%
Gemini 2.5 Flash Lite9898989898989697.8%
Gemini 2.5 Pro9898989898989697.8%
GPT-4.1 Nano10098989898969497.6%
ByteDance Seed 1.610098989898969497.6%
Grok 4 Fast10098989898969497.6%
Mistral Medium 3.19898989896969697.3%
o4 Mini High9898989896969697.3%
Grok 4.1 Fast100100989896949497.3%
Claude 3.5 Sonnet9898989896969697.3%
GPT-5 Nano9898989896969697.3%
Minimax M2.510098989696969497.0%
GPT-4.110098969696969496.8%
Gemini 2.5 Flash (Reasoning)9898989896949496.8%
Gemini 2.5 Flash9898989696969496.8%
Gemini 2.5 Flash Lite (Reasoning)100100969696949296.5%
DeepSeek-V2 Chat9898989696949496.5%
GPT-4o Mini (temp=1)9896969696969696.5%
o4 Mini100100969694949296.2%
Gemini 3.1 Pro (Preview)9696969696969696.2%
GPT-5.29696969696969696.2%
GPT-4o Mini (temp=0)9696969696969696.2%
Ministral 3 14B9696969696969696.2%
GPT-4.1 Mini9696969696969496.0%
DeepSeek V3.29898989694949296.0%
WizardLM 2 8x22b9896969696949496.0%
Gemma 3 12B9696969696949495.7%
ByteDance Seed 1.6 Flash9696969696949495.7%
Llama 3.1 8B9898989696929195.7%
DeepSeek V3 (2024-12-26)9896969694929295.1%
Ministral 3 8B9494949494949494.3%
Z.AI GLM 4.7 Flash9694949494949294.3%
Gemma 3 27B9894949492929193.8%
Ministral 8B9694949492929293.8%
Mistral NeMO9696949494928993.8%
DeepSeek V3 (2025-03-24)9898989696961184.9%
Arcee AI: Trinity Mini1009898989696083.8%
Cohere Command R+ (Aug. 2024)9892898775746082.2%
DeepSeek V3.1949292929138672.2%
Rocinante 12B96969692748066.0%
Ministral 3B9164554743424254.7%
Ministral 3 3B7458575547454253.9%
Claude 3 Haiku00000000.0%
Gemma 3 4B00000000.0%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Minimax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Z.AI GLM 4.7 Flash100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
GPT-4.1 Nano100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
Rocinante 12B1001001001001001009298.8%
DeepSeek V3 (2025-03-24)1001001001001001007596.4%
Arcee AI: Trinity Mini9292929292929291.7%
Hermes 3 405B100100100100100100085.7%
Ministral 3 14B100100100100100100085.7%
Claude 3 Haiku100100100100100100085.7%
Hermes 3 70B1001001001001000071.4%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
GPT-51001001001001001009899.7%
Aion 2.01001001001001001009899.7%
Z.AI GLM 4.61001001001001001009899.7%
Claude Sonnet 41001001001001001009899.7%
Z.AI GLM 4.51001001001001001009899.7%
Gemini 2.5 Pro100100100100100989899.5%
Grok 41001001001001001009699.5%
MoonshotAI: Kimi K2.5100100100100100989899.5%
Gemini 3.1 Pro (Preview)100100100100100989699.2%
Z.AI GLM 4.710010010010098989899.2%
GPT-4o, May 13th (temp=0)10010010010098989899.2%
GPT-4o, Aug. 6th (temp=1)100100100100100989699.2%
GPT-4o Mini (temp=0)10010010010098989899.2%
Qwen 3.5 397B A17B1001001009898989899.0%
Z.AI GLM 5100100100100100969699.0%
o4 Mini High1001001009898989899.0%
Gemini 2.5 Flash (Reasoning)1001001009898989899.0%
Gemini 3 Flash (Preview, Reasoning)100100989898989898.7%
Grok 4.1 Fast100100989898989898.7%
Grok 4 Fast100100989898989898.7%
GPT-5.11001001009898989598.5%
GPT-4o, May 13th (temp=1)10098989898989898.5%
GPT-4.1 Mini10098989898989898.5%
Gemini 2.5 Flash Lite (Reasoning)100100989898969698.2%
ByteDance Seed 1.69898989898989898.2%
GPT-5.29898989898989898.2%
Gemini 3 Pro (Preview)9898989898989898.2%
Mistral Large 39898989898989898.2%
DeepSeek-V2 Chat9898989898989898.2%
Mistral Large 29898989898989898.2%
DeepSeek V3.29898989898989898.2%
Gemini 2.5 Flash Lite9898989898989898.2%
Mistral Large9898989898989898.2%
GPT-4o Mini (temp=1)10098989898989698.2%
WizardLM 2 8x22b9898989898989898.2%
Claude Haiku 4.5100100989896969698.0%
GPT-5 Nano100100989898969598.0%
Mistral Medium 3.19898989898969697.7%
GPT-4.1 Nano10098989896969697.7%
Gemma 3 27B10098989696969697.5%
Claude Sonnet 4.510098989696969697.5%
Writer: Palmyra X59898989896969697.5%
Llama 3.1 70B10098969696969396.7%
Arcee AI: Trinity Large (Preview)9896969696969696.7%
Minimax M2.59896969696969596.5%
Qwen 3.5 Plus (2026-02-15)9896969696969596.5%
Claude 3.5 Sonnet9696969696969696.5%
Claude 3.7 Sonnet9696969696969696.5%
Hermes 3 405B9696969696969696.5%
Mistral Small 3.2 24B9696969696969696.5%
Qwen 2.5 72B9696969696969696.5%
Llama 3.1 Nemotron 70B9696969696969696.5%
Mistral Small Creative9696969696969696.5%
Ministral 3 14B9696969696969696.5%
Ministral 3 8B9696969696969696.5%
Llama 3.1 8B9696969696969696.5%
Claude Sonnet 4.69696969696969596.2%
Mistral NeMO9696969696969596.2%
Ministral 8B9696969696969396.0%
Cohere Command R+ (Aug. 2024)9896969695939395.5%
Gemma 3 12B9695959595959595.0%
Claude 3 Haiku9696959595959395.0%
Gemini 3 Flash (Preview)9595959595959594.7%
Rocinante 12B9696959593938493.2%
Ministral 3B9191919191918890.7%
Ministral 3 3B9191919188888889.7%
Gemma 3 4B9389898989888689.2%
DeepSeek V3.19898989898961686.2%
ByteDance Seed 1.6 Flash9898989696951885.7%
o4 Mini10010010010010098285.7%
DeepSeek V3 (2024-12-26)989898989896084.0%
DeepSeek V3 (2025-03-24)989898969696083.5%
Z.AI GLM 4.7 Flash100100969191251974.7%
Arcee AI: Trinity Mini7575707070706871.4%
Hermes 3 70B9500000013.5%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
GPT-5 Mini1001001001001001009899.7%
Z.AI GLM 4.61001001001001001009899.7%
o4 Mini1001001001001001009899.7%
DeepSeek-V2 Chat1001001001001001009899.7%
Gemma 3 12B1001001001001001009899.7%
Llama 3.1 70B1001001001001001009899.7%
Arcee AI: Trinity Large (Preview)100100100100100989899.3%
Claude 3 Haiku100100100100100989899.3%
Grok 4 Fast100100100100100989599.0%
DeepSeek V3 (2025-03-24)100100100100100989599.0%
Minimax M2.510010010010098989598.7%
GPT-5 Nano10010010010098989598.7%
Grok 4.1 Fast1001001009898989598.3%
Z.AI GLM 4.7 Flash100100989898989898.3%
Arcee AI: Trinity Mini100100989898989898.3%
Cohere Command R+ (Aug. 2024)1001001009898989598.3%
Gemini 2.5 Flash Lite (Reasoning)10098989898989898.0%
Llama 3.1 8B9898989898959597.0%
Ministral 3 14B100100959595959596.7%
GPT-4.1 Nano9898989895959596.7%
Mistral Small Creative9595959595939394.7%
Mistral NeMO9898989898888194.0%
ByteDance Seed 1.6 Flash10098959595886591.0%
Ministral 3 3B9591919191918891.0%
Ministral 3B9191919188888889.7%
DeepSeek V3.1100100100100449765.8%
Rocinante 12B4742281400018.6%
Gemma 3 4B1616161616161616.3%
Hermes 3 70B10000000014.3%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Minimax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.1100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Cohere Command R+ (Aug. 2024)100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
Rocinante 12B1001001001001001008898.2%
Z.AI GLM 4.7 Flash1001001001001001005092.9%
Arcee AI: Trinity Mini1001001008888888892.9%
Gemini 3.1 Pro (Preview)1001001001001001003891.1%
GPT-4.1 Nano8888888888888887.5%
Mistral NeMO8888888888888887.5%
Hermes 3 70B1001001006300051.8%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
ByteDance Seed 1.6 Flash100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Arcee AI: Trinity Mini100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
MoonshotAI: Kimi K2.51001001001001001009799.6%
o4 Mini1001001001001001009799.6%
Llama 3.1 Nemotron 70B1001001001001001009799.6%
o4 Mini High100100100100100979799.2%
GPT-4.1 Mini100100100100100979799.2%
GPT-4.1 Nano1001001001001001009599.2%
Cohere Command R+ (Aug. 2024)100100100100100979799.2%
Mistral Large 210010010010097979798.8%
GPT-5 Nano1001001009797979798.5%
Mistral Large100100979797979798.1%
Llama 3.1 8B100100100100100978998.1%
Mistral Large 39797979797979797.3%
Qwen 3.5 Plus (2026-02-15)1001001001001001007896.9%
Minimax M2.59797979797979596.9%
Z.AI GLM 4.7 Flash9797979797928695.0%
Mistral Medium 3.110010010010089898695.0%
GPT-5.21001001008989897892.3%
DeepSeek V3.1100100100100100923589.6%
Mistral Small 3.2 24B100100100100100100085.7%
Gemini 2.5 Flash Lite (Reasoning)100100100979797084.6%
Rocinante 12B1001001001009789083.8%
Hermes 3 70B1001001001900045.6%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
GPT-5.1100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
MoonshotAI: Kimi K2.5100100100100100100100100.0%
ByteDance Seed 1.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
GPT-5.2100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Grok 4.1 Fast100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Z.AI GLM 4.6100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Minimax M2.5100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
o4 Mini100100100100100100100100.0%
Grok 4100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Z.AI GLM 4.5100100100100100100100100.0%
Grok 4 Fast100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Gemini 2.5 Flash Lite (Reasoning)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Gemini 3 Flash (Preview)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
DeepSeek V3 (2024-12-26)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-5 Nano100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
GPT-4o Mini (temp=1)100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
GPT-4o Mini (temp=0)100100100100100100100100.0%
Gemma 3 27B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Arcee AI: Trinity Large (Preview)100100100100100100100100.0%
Mistral Small Creative100100100100100100100100.0%
Ministral 3 14B100100100100100100100100.0%
Ministral 3 8B100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Gemma 3 4B100100100100100100100100.0%
Ministral 3 3B100100100100100100100100.0%
Mistral NeMO100100100100100100100100.0%
Ministral 8B100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
Ministral 3B100100100100100100100100.0%
Claude 3.5 Sonnet1001001001001001009699.4%
ByteDance Seed 1.6 Flash1001001001001001009699.4%
GPT-4.1 Nano1001001001001001009298.9%
Cohere Command R+ (Aug. 2024)1001001001001001008097.1%
Claude 3 Haiku9696969696969696.0%
Arcee AI: Trinity Mini9696969696969295.4%
Z.AI GLM 4.7 Flash1001001001001001004892.6%
DeepSeek V3.110010010010010016474.3%
DeepSeek V3 (2025-03-24)10010010010010016073.7%
Rocinante 12B100100100962016462.3%
Hermes 3 70B1001000000028.6%
Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100100100100.0%
Gemini 3.1 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4.6 (Reasoning)100100100100100100100100.0%
GPT-5 Mini100100100100100100100100.0%
Claude Opus 4.6100100100100100100100100.0%
GPT-5100100100100100100100100.0%
Qwen 3.5 397B A17B100100100100100100100100.0%
Z.AI GLM 5100100100100100100100100.0%
Claude Sonnet 4.6100100100100100100100100.0%
Gemini 3 Flash (Preview, Reasoning)100100100100100100100100.0%
o4 Mini High100100100100100100100100.0%
Claude Opus 4.5100100100100100100100100.0%
Aion 2.0100100100100100100100100.0%
Gemini 3 Pro (Preview)100100100100100100100100.0%
Claude Sonnet 4100100100100100100100100.0%
Z.AI GLM 4.7100100100100100100100100.0%
GPT-4.1100100100100100100100100.0%
Gemini 2.5 Pro100100100100100100100100.0%
Claude Sonnet 4.5100100100100100100100100.0%
Claude Opus 4100100100100100100100100.0%
Gemini 2.5 Flash (Reasoning)100100100100100100100100.0%
Qwen 3.5 Plus (2026-02-15)100100100100100100100100.0%
Mistral Large 3100100100100100100100100.0%
GPT-4o, May 13th (temp=0)100100100100100100100100.0%
Claude Haiku 4.5100100100100100100100100.0%
DeepSeek-V2 Chat100100100100100100100100.0%
Claude 3.5 Sonnet100100100100100100100100.0%
GPT-4o, May 13th (temp=1)100100100100100100100100.0%
Claude 3.7 Sonnet100100100100100100100100.0%
GPT-4.1 Mini100100100100100100100100.0%
Hermes 3 405B100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=1)100100100100100100100100.0%
GPT-4o, Aug. 6th (temp=0)100100100100100100100100.0%
Mistral Large 2100100100100100100100100.0%
DeepSeek V3.2100100100100100100100100.0%
DeepSeek V3 (2025-03-24)100100100100100100100100.0%
Gemini 2.5 Flash Lite100100100100100100100100.0%
Gemini 2.5 Flash100100100100100100100100.0%
Mistral Large100100100100100100100100.0%
Writer: Palmyra X5100100100100100100100100.0%
Mistral Small 3.2 24B100100100100100100100100.0%
Gemma 3 12B100100100100100100100100.0%
Llama 3.1 70B100100100100100100100100.0%
Mistral Medium 3.1100100100100100100100100.0%
Qwen 2.5 72B100100100100100100100100.0%
Llama 3.1 Nemotron 70B100100100100100100100100.0%
Claude 3 Haiku100100100100100100100100.0%
WizardLM 2 8x22b100100100100100100100100.0%
Llama 3.1 8B100100100100100100100100.0%
MoonshotAI: Kimi K2.51001001001001001009899.7%
o4 Mini1001001001001001009899.7%
DeepSeek V3 (2024-12-26)1001001001001001009899.7%
Gemma 3 27B1001001001001001009899.7%
Mistral Small Creative1001001001001001009899.7%
GPT-5.2100100100100100989899.5%
Z.AI GLM 4.6100100100100100989899.5%
Grok 4 Fast100100100100100989899.5%
Arcee AI: Trinity Large (Preview)100100100100100989899.5%
Z.AI GLM 4.5100100100100100989699.2%
Gemma 3 4B10010010010098989899.2%
GPT-5 Nano10010010010098989899.2%
Ministral 3B10010010010098989698.9%
Grok 4.1 Fast1001001009898989898.9%
GPT-5.110010010010098969698.7%
Minimax M2.510098989898989898.4%
Cohere Command R+ (Aug. 2024)100100989898989698.4%
Gemini 3 Flash (Preview)1001001009898989298.1%
GPT-4o Mini (temp=1)9898989898989898.1%
GPT-4o Mini (temp=0)9898989898989898.1%
ByteDance Seed 1.6 Flash10098989898989698.1%
Ministral 3 14B9898989898989898.1%
Grok 410098989898969697.8%
GPT-4.1 Nano9898989898989697.8%
Ministral 3 3B9898989896969697.3%
ByteDance Seed 1.69898989696969496.8%
Ministral 3 8B10096969696969696.8%
Ministral 8B9896969696969696.5%
Arcee AI: Trinity Mini9898969696969296.2%
Mistral NeMO9696969696969696.2%
DeepSeek V3.11001001001001001002689.5%
Z.AI GLM 4.7 Flash9896969696911784.4%
Rocinante 12B969696969694282.5%
Gemini 2.5 Flash Lite (Reasoning)100100100989866080.3%
Hermes 3 70B100100100000042.9%