Anthropic
Comparing 13 models from Anthropic.
| Model | Total ▼ | Released | Context | CoT | Tooling | Creative Writing | Language | Utility | Reasoning | Text Editing | Rule Following | Hallucination |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.6 (Reasoning) | 95.02% | Feb 4, 26 | 1m | ✓ | 100.00% | 84.55% | 96.12% | 98.93% | 93.77% | 98.86% | 89.78% | 98.13% |
| Claude Sonnet 4.6 (Reasoning) | 93.66% | Feb 17, 26 | 1m | ✓ | 100.00% | 83.09% | 97.58% | 97.88% | 92.76% | 98.30% | 85.73% | 93.96% |
| Claude Opus 4.6 | 92.35% | Feb 4, 26 | 1m | – | 100.00% | 83.59% | 96.13% | 90.72% | 93.33% | 98.35% | 83.11% | 93.60% |
| Claude Sonnet 4.6 | 91.15% | Feb 17, 26 | 1m | – | 100.00% | 83.31% | 100.00% | 88.52% | 88.48% | 96.37% | 82.50% | 89.99% |
| Claude Opus 4.5 | 89.69% | Nov 24, 25 | 200k | – | 100.00% | 81.71% | 99.66% | 89.84% | 93.93% | 97.69% | 72.61% | 82.06% |
| Claude Sonnet 4 | 88.72% | May 22, 25 | 200k | – | 100.00% | 79.21% | 91.31% | 84.02% | 94.48% | 99.13% | 81.52% | 80.08% |
| Claude Sonnet 4.5 | 88.03% | Sep 29, 25 | 1m | – | 100.00% | 84.19% | 92.39% | 83.78% | 92.50% | 99.02% | 76.80% | 75.57% |
| Claude Opus 4 | 87.69% | May 22, 25 | 200k | – | 100.00% | 83.79% | 93.01% | 88.81% | 92.59% | 97.25% | 70.37% | 75.68% |
| Claude Haiku 4.5 | 85.14% | Oct 15, 25 | 200k | – | 99.10% | 78.96% | 91.84% | 72.48% | 87.76% | 96.81% | 70.35% | 83.86% |
| Claude 3.5 Sonnet | 84.24% | Jun 20, 24 | 200k | – | 100.00% | 78.69% | 85.62% | 76.75% | 90.30% | 96.57% | 69.67% | 76.31% |
| Claude 3.5 Haiku | 83.73% | Oct 22, 24 | 200k | – | 99.69% | 75.28% | 82.12% | 82.57% | 82.23% | – | 64.18% | 100.00% |
| Claude 3.7 Sonnet | 83.39% | Feb 19, 25 | 200k | – | 99.32% | 76.31% | 92.95% | 62.54% | 89.94% | 97.12% | 73.78% | 75.18% |
| Claude 3 Haiku | 71.19% | Mar 13, 24 | 200k | – | 99.47% | 74.53% | 72.76% | 68.47% | 77.94% | 64.36% | 51.15% | 60.81% |
Model Performance
Cost vs Performance
Compares total benchmark cost against overall score for Anthropic models. Quadrant lines are drawn at the median values.
1 low-scoring outlier hidden: Claude 3 Haiku (71.2%).
Cost Breakdown
Total benchmark cost per model, broken down by input, reasoning, and output tokens. Toggle between USD and token views.