Passive → active voice transformations

Test: Text Replacement

Avg. Score
71.2%
Scenarios
2

Overall Performance

Rank ▲ Model Score Avg. Cost Avg. Time Stability
1Gemini 3.5 Flash (Reasoning, Minimal)94.5%$0.00813.5s90%
2Qwen 3.5 Plus (2026-02-15)90.7%$0.002210.2s89%
3Claude Sonnet 495.6%$0.0159.4s91%
4Claude Sonnet 4.595.9%$0.0157.2s88%
5Gemini 2.5 Flash90.9%$0.00213.0s84%
6Gemini 2.5 Flash (Reasoning)94.8%$0.01422.4s89%
7Grok 4.1 Fast93.1%$0.001722.5s83%
8Gemini 2.5 Flash Lite (Reasoning)92.9%$0.003332.9s85%
9DeepSeek V4 Pro91.2%$0.002022.3s83%
10Gemma 4 31B92.0%$0.000444.2s85%
11ByteDance Seed 1.695.9%$0.00771.4m94%
12GPT-5.4 (Reasoning, Low)94.2%$0.02014.1s86%
13DeepSeek V3.292.3%$0.000753.2s85%
14Gemini 3.1 Flash Lite88.5%$0.00132.4s78%
15Gemini 3 Flash (Preview)89.0%$0.00274.3s79%
16Grok 4.20 (Beta)86.5%$0.00472.5s79%
17GPT-5.592.3%$0.0266.8s86%
18GPT-5.5 (Reasoning, Low)94.0%$0.03211.5s88%
19Claude Opus 4.690.7%$0.0268.7s87%
20GPT-5.489.0%$0.0138.3s82%
21Grok 4.20 (Reasoning)95.6%$0.0181.1m89%
22GPT-5.195.3%$0.02934.1s89%
23Z.AI GLM 4.591.2%$0.006352.4s83%
24Claude 3.7 Sonnet87.9%$0.0158.5s82%
25GPT-5.293.7%$0.02522.2s84%
26Z.AI GLM 5 Turbo93.7%$0.02250.4s88%
27Gemini 3 Flash (Preview, Reasoning)92.9%$0.02236.9s85%
28Grok 4.20 (Beta, Reasoning)95.1%$0.04026.6s89%
29Claude Opus 4.586.8%$0.0268.1s85%
30Gemini 3.1 Flash Lite (Reasoning)86.0%$0.00138.2s71%
31Mistral Large 382.7%$0.001610.5s74%
32Z.AI GLM 4.690.4%$0.007857.8s79%
33Gemini 3.1 Flash Lite (Preview)83.2%$0.00132.4s70%
34GPT-5 Mini88.5%$0.008250.8s79%
35Gemma 4 26B83.2%$0.000327.1s75%
36Mistral Large82.7%$0.006210.4s74%
37Mistral Large 283.0%$0.006210.5s74%
38Grok 4.3 (Reasoning)93.7%$0.0191.5m87%
39Grok 494.5%$0.03947.6s87%
40Grok 4.2081.3%$0.00296.5s70%
41DeepSeek V3 (2024-12-26)81.0%$0.001223.1s73%
42GPT-5.4 Mini (Reasoning)87.6%$0.01837.8s77%
43GPT-5.4 (Reasoning)96.4%$0.05247.2s90%
44Claude Haiku 4.581.9%$0.00515.8s69%
45Claude 3.5 Sonnet86.3%$0.03013.8s78%
46GPT-5.5 (Reasoning)96.2%$0.06825.1s91%
47DeepSeek V4 Flash (Reasoning)90.4%$0.00132.3m82%
48Qwen 3.6 Flash85.7%$0.01445.2s75%
49GPT-596.4%$0.0501.3m91%
50GPT-4.1 Mini79.7%$0.001510.3s66%
51Xiaomi MIMO v2.5 Pro84.9%$0.007332.3s68%
52Stealth: Healer Alpha81.6%$0.000024.2s65%
53Claude Opus 4.787.1%$0.0367.3s76%
54Gemini 3 Pro (Preview)96.7%$0.06443.7s90%
55Qwen 3.5 Plus (2026-04-20)93.1%$0.0202.1m84%
56Gemini 2.5 Pro94.2%$0.05641.9s85%
57Grok 4 Fast86.8%$0.001411.7s55%
58GPT-5.4 Mini (Reasoning, Low)75.8%$0.00527.9s67%
59Gemini 3.5 Flash (Reasoning)97.3%$0.07733.0s91%
60Xiaomi MIMO v2.577.7%$0.004015.0s61%
61Qwen 3.5 397B A17B96.2%$0.0113.7m89%
62MiniMax M2.582.4%$0.00211.3m67%
63GPT-5.4 Mini70.6%$0.00403.2s64%
64DeepSeek V4 Pro (Reasoning)94.5%$0.0133.5m88%
65Gemma 4 26B (Reasoning)92.0%$0.00423.4m84%
66Stealth: Hunter Alpha83.2%$0.000029.8s54%
67Claude Sonnet 4.679.4%$0.0157.5s61%
68Qwen 3.6 35B81.3%$0.0111.0m66%
69Z.AI GLM 594.8%$0.0233.2m87%
70Z.AI GLM 4.792.3%$0.0173.0m83%
71DeepSeek V4 Flash80.5%$0.000310.6s50%
72Qwen 3.5 27B95.6%$0.0372.9m88%
73Mistral Small 467.9%$0.00064.5s57%
74Claude Opus 4.7 (Reasoning)87.1%$0.06413.9s75%
75Llama 3.1 70B70.1%$0.000718.9s55%
76Qwen 3.5 35B84.6%$0.0251.3m66%
77Mistral Medium 3.173.4%$0.00186.3s48%
78Z.AI GLM 5.196.2%$0.0443.4m91%
79Gemini 2.5 Flash Lite70.3%$0.00042.5s47%
80ByteDance Seed 1.6 Flash68.4%$0.001221.9s52%
81GPT-4o, Aug. 6th (temp=0)77.2%$0.00913.9s43%
82ByteDance Seed 2.0 Lite77.7%$0.00851.6m60%
83Aion 2.087.6%$0.00691.5m47%
84Qwen3 235B A22B Instruct 250763.7%$0.000419.2s52%
85GPT-4o, May 13th (temp=0)76.1%$0.0154.6s43%
86GPT-4.173.6%$0.00766.4s41%
87Qwen3.7 Max92.3%$0.0722.1m84%
88DeepSeek V3.175.3%$0.000935.4s41%
89Arcee AI: Trinity Large (Preview)65.4%$0.000031.5s48%
90Writer: Palmyra X564.3%$0.005013.7s47%
91Gemma 4 31B (Reasoning)95.6%$0.00276.3m90%
92LFM2 24B55.8%$0.000115.5s50%
93Grok 4.367.0%$0.00306.0s36%
94Qwen 3.5 Flash79.1%$0.00501.4m42%
95DeepSeek-V2 Chat68.7%$0.001121.5s34%
96Claude Opus 4.6 (Reasoning)96.2%$0.1291.1m88%
97Mistral Small 4 (Reasoning)64.8%$0.003634.3s40%
98GPT-OSS 120B67.6%$0.001046.5s35%
99Qwen3.6 Max Preview94.2%$0.0664.0m86%
100Claude Opus 487.1%$0.07713.4s51%
101Mistral Small Creative55.2%$0.00034.1s37%
102DeepSeek V3 (2025-03-24)66.2%$0.000842.6s33%
103Z.AI GLM 4.5 Air73.9%$0.00591.5m38%
104Claude Sonnet 4.6 (Reasoning)92.0%$0.1231.5m85%
105Mistral Small 3.2 24B54.4%$0.00036.9s33%
106Qwen 3 32B59.1%$0.001043.4s35%
107Qwen 3.5 122B84.1%$0.0411.9m47%
108Qwen 2.5 72B45.1%$0.000416.3s39%
109GPT-4o, May 13th (temp=1)60.2%$0.0154.7s29%
110MoonshotAI: Kimi K2.593.1%$0.0296.5m84%
111Llama 3.1 Nemotron 70B48.9%$0.002120.8s33%
112Ministral 3 14B45.6%$0.00035.7s30%
113Qwen 3.6 27B79.4%$0.0271.9m35%
114WizardLM 2 8x22b67.0%$0.00141.5m27%
115o4 Mini High75.3%$0.0561.4m46%
116GPT-4o Mini (temp=1)42.6%$0.000612.4s26%
117ByteDance Seed 2.0 Mini75.8%$0.00364.0m43%
118MiniMax M2.769.2%$0.0143.1m40%
119GPT-4o, Aug. 6th (temp=1)45.9%$0.00864.1s20%
120Nemotron 3 Super62.9%$0.00002.5m26%
121Gemini 3.1 Pro (Preview)94.2%$0.1682.6m91%
122o4 Mini58.2%$0.02639.2s21%
123Inception Mercury 242.6%$0.00273.7s11%
124GPT-4o Mini (temp=0)36.0%$0.000613.8s12%
125Gemma 3 12B38.5%$0.000112.8s8%
126Z.AI GLM 4.7 Flash47.3%$0.00342.4m26%
127Claude 3 Haiku36.0%$0.00136.8s7%
128MoonshotAI: Kimi K2.693.1%$0.0727.6m82%
129Cydonia 24B V4.134.3%$0.000618.0s8%
130Hermes 3 405B42.6%$0.001731.7s1%
131Skyfall 36B V223.6%$0.001013.0s10%
132GPT-5.4 Nano (Reasoning, Low)28.0%$0.00147.4s3%
133Gemma 3 27B30.2%$0.000324.0s2%
134Inception Mercury22.3%$0.00053.8s5%
135Ministral 3 8B18.4%$0.00024.2s8%
136GPT-5.4 Nano (Reasoning)22.8%$0.003422.3s6%
137Qwen 3.5 9B47.0%$0.00202.9m12%
138Ministral 8B12.9%$0.00023.9s4%
139GPT-4.1 Nano10.4%$0.00045.0s4%
140GPT-5.4 Nano11.0%$0.00113.8s4%
141Arcee AI: Trinity Mini13.2%$0.000311.2s1%
142Nemotron 3 Nano40.7%$0.00344.0m19%
143GPT-5 Nano23.4%$0.00521.9m10%
144Llama 3.1 8B6.6%$0.000115.3s3%
145Gemma 3 4B7.7%$0.000110.1s0%
146Mistral NeMO5.5%$0.00023.1s0%
147Ministral 3B2.7%$0.00012.9s0%
148Cohere Command R+ (Aug. 2024)6.9%$0.009218.4s3%
149Ministral 3 3B0.8%$0.00022.9s0%
150Rocinante 12B0.5%$0.000510.0s0%
151Hermes 3 70B21.4%$0.00222.8m3%
71.17%

Individual Scenarios

Generic Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
GPT-510010010010096969297.8%
Grok 4.20 (Reasoning)10010010010096969297.8%
Claude Sonnet 4.510010010010096969297.8%
GPT-5.4 (Reasoning)1001001009696969297.3%
GPT-5.11001001009696969297.3%
Z.AI GLM 5.1100100969696968896.2%
Qwen 3.5 27B100100969696929296.2%
Grok 4.1 Fast1001001009692929296.2%
Gemini 3.5 Flash (Reasoning)1001001009696928896.2%
GPT-5.5 (Reasoning)100100969696929296.2%
Qwen 3.5 397B A17B1001001009696928896.2%
Grok 4.20 (Beta, Reasoning)1001001009696928896.2%
GPT-5.4 (Reasoning, Low)100100969692929295.6%
ByteDance Seed 1.69696969696969295.6%
Gemini 2.5 Flash (Reasoning)100100969696968595.6%
Z.AI GLM 5 Turbo10096969696928895.1%
MoonshotAI: Kimi K2.510096969696928895.1%
GPT-5.5 (Reasoning, Low)10096969692929295.1%
Gemma 4 31B (Reasoning)100100969692928895.1%
Aion 2.0100100969292929295.1%
Grok 41001001009696928195.1%
Qwen 3.5 Plus (2026-04-20)100100969692928895.1%
Grok 4.3 (Reasoning)10096969696968194.5%
Gemini 3 Pro (Preview)10096969292929294.5%
Gemini 2.5 Flash Lite (Reasoning)10096969692928894.5%
Claude Sonnet 49696969292929294.0%
Claude Opus 4.6 (Reasoning)10096969292888893.4%
DeepSeek V4 Pro (Reasoning)9696969292928893.4%
Gemini 3.5 Flash (Reasoning, Minimal)10096969292888893.4%
Gemini 2.5 Flash9696969292928893.4%
Z.AI GLM 5100100969292888593.4%
Qwen3.6 Max Preview100100929288888892.9%
Gemini 3.1 Pro (Preview)9696929292888892.3%
GPT-5.59696929292928592.3%
Z.AI GLM 4.79696929288888891.8%
Z.AI GLM 4.59696969688858591.8%
DeepSeek V4 Flash (Reasoning)9696929288888591.2%
GPT-5.49692929292888591.2%
MoonshotAI: Kimi K2.610096928888888591.2%
GPT-5.210096969288887791.2%
Gemini 3 Flash (Preview, Reasoning)9692929292858590.7%
Gemini 2.5 Pro10096928888858590.7%
Gemma 4 31B9692929292888190.7%
Qwen 3.5 Plus (2026-02-15)9292929288888890.7%
DeepSeek V4 Pro9692929292858590.7%
Claude Sonnet 4.6 (Reasoning)9692929288888590.7%
Gemma 4 26B (Reasoning)9692929288888590.7%
Z.AI GLM 4.69696928888888190.1%
DeepSeek V3.29696929288858190.1%
Qwen3.7 Max9292929288858589.6%
Claude Opus 4.69292928888858589.0%
GPT-5 Mini9292888888887787.9%
Claude 3.7 Sonnet9288888888887787.4%
Claude Opus 4.58888888888858587.4%
Qwen 3.5 35B10096929288854685.7%
Gemma 4 26B9288858585817784.6%
Gemini 3 Flash (Preview)8888888585817784.6%
Grok 4.20 (Beta)9688888581817384.6%
GPT-5.4 Mini (Reasoning)8888888581777783.5%
Claude Opus 410010096969292883.5%
Claude Opus 4.7 (Reasoning)9285858581777382.4%
Qwen 3.6 Flash9288858181777382.4%
Gemini 3.1 Flash Lite9285818181777781.9%
Grok 4 Fast9692929292881581.3%
MiniMax M2.59688818177776981.3%
Claude Haiku 4.59681818177777380.8%
Claude Opus 4.79288858577736580.8%
Xiaomi MIMO v2.59292858577696280.2%
Claude 3.5 Sonnet9281817777777379.7%
Xiaomi MIMO v2.5 Pro8885858581696279.1%
Gemini 3.1 Flash Lite (Preview)8581818177777379.1%
DeepSeek-V2 Chat8581817777777779.1%
Gemini 3.1 Flash Lite (Reasoning)8885857773736978.6%
Qwen 3.5 Flash969292888885878.6%
Mistral Large8581777777777778.6%
Qwen 3.6 35B9292857773735478.0%
Mistral Large 28181777777777778.0%
Stealth: Hunter Alpha9692888881811577.5%
Mistral Large 38177777777777777.5%
DeepSeek V3 (2024-12-26)8177777777777376.9%
Qwen 3.5 122B100100969292351575.8%
Stealth: Healer Alpha9285776969696575.3%
ByteDance Seed 2.0 Lite8585817769696275.3%
Grok 4.208581777373696975.3%
ByteDance Seed 1.6 Flash8181818173656274.7%
GPT-4.1 Mini8181777777695874.2%
GPT-5.4 Mini7777777373696973.6%
DeepSeek V3.18888858581771273.6%
GPT-5.4 Mini (Reasoning, Low)8177737369696272.0%
DeepSeek V4 Flash888585858573472.0%
MiniMax M2.79281696969655471.4%
GPT-OSS 120B8581817773731569.2%
Mistral Small 4 (Reasoning)8577777777731568.7%
Qwen 3.6 27B96969692884468.1%
Claude Sonnet 4.67369696565656567.6%
Z.AI GLM 4.5 Air969692888112867.6%
Mistral Small 47369696965624664.8%
Llama 3.1 70B7773736965504664.8%
GPT-4o, May 13th (temp=0)857773737369064.3%
ByteDance Seed 2.0 Mini888585777727863.7%
DeepSeek V3 (2025-03-24)777773737365863.7%
GPT-4o, Aug. 6th (temp=0)817777777358063.2%
GPT-4.18885818177151262.6%
o4 Mini High8585817350501262.1%
Gemini 2.5 Flash Lite817777736254861.5%
Nemotron 3 Super928873694242458.8%
Grok 4.392858181508457.1%
Mistral Medium 3.16969625050464656.0%
Qwen3 235B A22B Instruct 25076558585454504654.9%
LFM2 24B5854545454545053.8%
Writer: Palmyra X56262585450503152.2%
Arcee AI: Trinity Large (Preview)6562585042424251.6%
Qwen 3 32B696562545446851.1%
Inception Mercury 281737365464449.5%
GPT-4o, Aug. 6th (temp=1)776565545412046.7%
WizardLM 2 8x22b81817754230045.1%
Qwen 2.5 72B5050464238383542.9%
Qwen 3.5 9B9281622388039.0%
GPT-4o, May 13th (temp=1)7765545488438.5%
Mistral Small 3.2 24B4242424238382338.5%
Mistral Small Creative5046383835352738.5%
Llama 3.1 Nemotron 70B5442423531313137.9%
Z.AI GLM 4.7 Flash73584631158033.0%
GPT-4o Mini (temp=1)6565621244430.8%
Ministral 3 8B4238353131191530.2%
Ministral 3 14B545050121212828.0%
Nemotron 3 Nano73462315128826.4%
o4 Mini653527151212824.7%
Hermes 3 70B6550151288022.5%
Ministral 8B352727232319022.0%
GPT-5.4 Nano (Reasoning, Low)77504440019.8%
Gemma 3 4B3127271544015.4%
Cydonia 24B V4.13588888411.0%
GPT-4.1 Nano2319191240011.0%
Skyfall 36B V23519844009.9%
GPT-5 Nano1512888848.8%
GPT-5.4 Nano (Reasoning)1212888447.7%
Gemma 3 27B88888887.7%
Inception Mercury88888406.0%
GPT-5.4 Nano128444445.5%
Ministral 3B234440004.9%
Hermes 3 405B88444404.4%
Llama 3.1 8B154440003.8%
GPT-4o Mini (temp=0)44444002.7%
Arcee AI: Trinity Mini154000002.7%
Cohere Command R+ (Aug. 2024)44440002.2%
Ministral 3 3B44400001.6%
Mistral NeMO84000001.6%
Rocinante 12B40000000.5%
Gemma 3 12B00000000.0%
Claude 3 Haiku00000000.0%

Specific Prompt

Model # 1 # 2 # 3 # 4 # 5 # 6 # 7 Avg ▼
Claude Opus 4.6 (Reasoning)100100100100100969698.9%
Gemini 3 Pro (Preview)100100100100100969698.9%
Gemini 3.5 Flash (Reasoning)10010010010096969698.4%
Gemini 2.5 Pro10010010010096969297.8%
Claude Sonnet 41001001009696969297.3%
Gemma 4 31B (Reasoning)10096969696969296.2%
Z.AI GLM 5100100969696929296.2%
GPT-5.2100100969696968896.2%
Gemini 3.1 Pro (Preview)9696969696969696.2%
Z.AI GLM 5.110096969696969296.2%
GPT-5.5 (Reasoning)10096969696969296.2%
Qwen 3.5 397B A17B100100969696929296.2%
ByteDance Seed 1.69696969696969696.2%
Qwen3.6 Max Preview100100969696928895.6%
DeepSeek V4 Pro (Reasoning)1001001009696888895.6%
GPT-5.4 (Reasoning)1001001009292929295.6%
Gemini 3.5 Flash (Reasoning, Minimal)9696969696969295.6%
Qwen3.7 Max10096969696928895.1%
GPT-59696969696929295.1%
Qwen 3.5 27B100100969696928595.1%
Gemini 3 Flash (Preview, Reasoning)9696969696929295.1%
Gemini 3.1 Flash Lite9696969696929295.1%
MoonshotAI: Kimi K2.610010010010096888195.1%
DeepSeek V3.29696969696928894.5%
Grok 4.20 (Beta, Reasoning)9696969692928894.0%
Claude Sonnet 4.510096969692928594.0%
Gemini 2.5 Flash (Reasoning)9696969292929294.0%
Grok 49696969692928894.0%
GPT-5.110096969292888893.4%
Gemma 4 26B (Reasoning)9696969696888593.4%
Grok 4.20 (Reasoning)9696969292928893.4%
Claude Opus 4.79696929292929293.4%
Gemma 4 31B9696929292929293.4%
Gemini 3.1 Flash Lite (Reasoning)9696969692928593.4%
Gemini 3 Flash (Preview)9696969292928893.4%
Claude Sonnet 4.6 (Reasoning)10096969292888893.4%
GPT-5.4 (Reasoning, Low)9696929292928892.9%
Claude 3.5 Sonnet9692929292929292.9%
Grok 4.3 (Reasoning)9696969292888892.9%
GPT-5.5 (Reasoning, Low)9696969292888892.9%
Z.AI GLM 4.710096969692888192.9%
Claude Opus 4.69292929292929292.3%
GPT-5.59696929292888892.3%
Grok 4 Fast9696929292888892.3%
Z.AI GLM 5 Turbo9696929292888892.3%
Qwen 3.5 122B9692929292928892.3%
Claude Opus 4.7 (Reasoning)9696929292888591.8%
GPT-5.4 Mini (Reasoning)9696969292858591.8%
o4 Mini9696929292888591.8%
DeepSeek V4 Pro10096969288858591.8%
Qwen 3.5 Plus (2026-04-20)9696929288888591.2%
Claude Sonnet 4.69292929292888891.2%
MoonshotAI: Kimi K2.510096969292887391.2%
Gemini 2.5 Flash Lite (Reasoning)9696929288888591.2%
GPT-4o, Aug. 6th (temp=0)9696969685858591.2%
Qwen 3.6 27B9696969288858190.7%
Z.AI GLM 4.610096928888888190.7%
Claude Opus 49692929288888590.7%
Z.AI GLM 4.59692929292888190.7%
Qwen 3.5 Plus (2026-02-15)9292929288888890.7%
Mistral Medium 3.19292929292888590.7%
Xiaomi MIMO v2.5 Pro10096968888858190.7%
Grok 4.1 Fast9696929285858590.1%
DeepSeek V4 Flash (Reasoning)9696929288857789.6%
DeepSeek V4 Flash9288888888888889.0%
GPT-5 Mini9692929288857789.0%
Stealth: Hunter Alpha9692888888858589.0%
WizardLM 2 8x22b10092928888857789.0%
Qwen 3.6 Flash9692928888858189.0%
o4 Mini High9696928888857388.5%
Grok 4.20 (Beta)9288888888888588.5%
Gemini 2.5 Flash9292928888858188.5%
Claude 3.7 Sonnet9288888888888588.5%
ByteDance Seed 2.0 Mini9288888888858587.9%
Mistral Large 38888888888888587.9%
Mistral Large 29288888888858587.9%
Stealth: Healer Alpha9692929288817387.9%
GPT-4o, May 13th (temp=0)9692929285817787.9%
Gemini 3.1 Flash Lite (Preview)9696888585818187.4%
Grok 4.208888888888888187.4%
GPT-5.49288888585858586.8%
Mistral Large9288888885858186.8%
Claude Opus 4.58888888585858586.3%
GPT-4.1 Mini9692858581817785.2%
DeepSeek V3 (2024-12-26)8888858585858185.2%
Qwen 3.6 35B9288858581818184.6%
GPT-4.19688858585817384.6%
MiniMax M2.59692888181737383.5%
Qwen 3.5 35B8888888585777383.5%
Claude Haiku 4.59692858177777383.0%
GPT-4o, May 13th (temp=1)8885858581816981.9%
Gemma 4 26B8888888181776981.9%
Hermes 3 405B10010096928885480.8%
Aion 2.0969692929292080.2%
Z.AI GLM 4.5 Air9292928869656280.2%
ByteDance Seed 2.0 Lite9288888885853580.2%
Qwen 3.5 Flash9692928885812379.7%
GPT-5.4 Mini (Reasoning, Low)8585818181776979.7%
Gemini 2.5 Flash Lite8881817777777379.1%
Arcee AI: Trinity Large (Preview)8585858173737379.1%
DeepSeek V3.19292888885771576.9%
Grok 4.38885817773736276.9%
Gemma 3 12B8181818173736976.9%
Writer: Palmyra X58581777773736976.4%
Llama 3.1 70B8885817369696275.3%
Xiaomi MIMO v2.59288818169625475.3%
Qwen3 235B A22B Instruct 25077777737373696572.5%
Mistral Small Creative7773737373696572.0%
Claude 3 Haiku8177777373695472.0%
Mistral Small 48581696965656270.9%
Mistral Small 3.2 24B8585776262626270.3%
GPT-4o Mini (temp=0)7373696969656569.2%
DeepSeek V3 (2025-03-24)928885858546068.7%
GPT-5.4 Mini7373696965626267.6%
MiniMax M2.7928885818138467.0%
Nemotron 3 Super9696928569151567.0%
Qwen 3 32B8881777769651267.0%
GPT-OSS 120B9288888577151565.9%
Ministral 3 14B6965656262625863.2%
ByteDance Seed 1.6 Flash8173656565424262.1%
Z.AI GLM 4.7 Flash8177736958581561.5%
Mistral Small 4 (Reasoning)737373737354861.0%
Llama 3.1 Nemotron 70B7369656258504259.9%
DeepSeek-V2 Chat85858181770058.2%
Cydonia 24B V4.1857773656531857.7%
LFM2 24B6562585854545457.7%
Qwen 3.5 9B9681817719151554.9%
Nemotron 3 Nano7373696262351254.9%
GPT-4o Mini (temp=1)696565626258054.4%
Gemma 3 27B92888177158852.7%
Qwen 2.5 72B6250464642424247.3%
GPT-4o, Aug. 6th (temp=1)8885736540045.1%
Inception Mercury7373424215121238.5%
GPT-5 Nano5450463835271537.9%
GPT-5.4 Nano (Reasoning)776235313127437.9%
Skyfall 36B V2694642423815837.4%
GPT-5.4 Nano (Reasoning, Low)8177651288436.3%
Inception Mercury 28885151515151535.7%
Arcee AI: Trinity Mini62312723230023.6%
Hermes 3 70B73690000020.3%
GPT-5.4 Nano621915884016.5%
Cohere Command R+ (Aug. 2024)2319151244411.5%
GPT-4.1 Nano3819840009.9%
Mistral NeMO4223000009.3%
Llama 3.1 8B31151280009.3%
Ministral 3 8B88888446.6%
Ministral 8B84444403.8%
Ministral 3B40000000.5%
Rocinante 12B40000000.5%
Gemma 3 4B00000000.0%
Ministral 3 3B00000000.0%