Benchmark Results
Score vs Cost
Leaderboard
| # | Model | Score | Tuning Breakdown | Cost | Last Tested | |
|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Google | 100.0% | Std 100% DropD 100% HSD 100% DropDb 100% | $1.296 | Apr 11, 2026 | |
| 2 | DeepSeek V3.2 SpecialeOW DeepSeek | 99.6% | Std 99% DropD 100% HSD 100% DropDb 100% | $0.423 | Mar 10, 2026 | |
| 3 | DeepSeek V3.2 Speciale (Reasoning)OW DeepSeek | 99.6% | Std 99% DropD 100% HSD 100% DropDb 100% | $0.437 | Mar 10, 2026 | |
| 4 | Qwen 3.5 PlusOW Alibaba | 99.6% | Std 99% DropD 100% HSD 100% DropDb 100% | $0.605 | Mar 9, 2026 | |
| 5 | Kimi K2.5 (Reasoning)OW Moonshot | 99.6% | Std 99% DropD 100% HSD 100% DropDb 100% | $0.625 | Mar 10, 2026 | |
| 6 | Kimi K2.5OW Moonshot | 99.6% | Std 99% DropD 100% HSD 100% DropDb 100% | $0.649 | Mar 10, 2026 | |
| 7 | Qwen 3.5 FlashOW Alibaba | 98.3% | Std 99% DropD 100% HSD 99% DropDb 94% | $0.231 | Mar 9, 2026 | |
| 8 | Claude Opus 4.7 Anthropic | 87.2% | Std 87% DropD 88% HSD 89% DropDb 83% | $2.378 | Apr 19, 2026 | |
| 9 | MiniMax M2.5OW MiniMax | 84.9% | Std 85% DropD 80% HSD 87% DropDb 88% | $0.157 | Mar 9, 2026 | |
| 10 | MiniMax M2.5 (Reasoning)OW MiniMax | 82.8% | Std 85% DropD 78% HSD 81% DropDb 88% | $0.159 | Mar 9, 2026 | |
| 11 | GPT-5.4 OpenAI | 63.2% | Std 74% DropD 69% HSD 49% DropDb 58% | $0.319 | Mar 9, 2026 | |
| 12 | Claude Opus 4.6 Anthropic | 59.8% | Std 69% DropD 51% HSD 54% DropDb 58% | $0.747 | Mar 9, 2026 | |
| 13 | Gemma 4 31BOW Google | 51.7% | Std 74% DropD 58% HSD 43% DropDb 28% | $0.034 | Apr 19, 2026 | |
| 14 | Gemini 3.1 Flash Lite Google | 38.5% | Std 65% DropD 53% HSD 10% DropDb 6% | $0.031 | Mar 9, 2026 | |
| 15 | Gemma 4 26B A4BOW Google | 36.2% | Std 59% DropD 52% HSD 19% DropDb 5% | $0.025 | Apr 19, 2026 | |
| 16 | Llama 3.3 70BOW Meta | 24.7% | Std 20% DropD 13% HSD 39% DropDb 24% | $0.031 | Mar 9, 2026 | |
| 17 | Mistral LargeOW Mistral | 24.7% | Std 30% DropD 31% HSD 24% DropDb 3% | $0.053 | Mar 9, 2026 | |
| 18 | Claude Sonnet 4.6 Anthropic | 20.5% | Std 30% DropD 20% HSD 16% DropDb 6% | $0.590 | Mar 9, 2026 | |
| 19 | DeepSeek V3.2OW DeepSeek | 20.1% | Std 21% DropD 33% HSD 16% DropDb 9% | $0.029 | Mar 9, 2026 | |
| 20 | Claude Haiku 4.5 Anthropic | 17.6% | Std 21% DropD 27% HSD 10% DropDb 12% | $0.168 | Mar 9, 2026 | |
| 21 | Llama 4 ScoutOW Meta | 13.8% | Std 19% DropD 13% HSD 11% DropDb 6% | $0.016 | Mar 9, 2026 |
Tuning Difficulty
Drop Db
Half-Step Down
Drop D
Standard
Hardest Questions
| ID | Tuning | Success Rate | Attempts |
|---|---|---|---|
FB_172 | Drop D | 33.3% | 24 |
FB_209 | Standard | 36.4% | 22 |
FB_148 | Standard | 37.5% | 24 |
FB_169 | Standard | 37.5% | 24 |
FB_202 | Standard | 40.9% | 22 |
FB_204 | Half-Step Down | 40.9% | 22 |
FB_225 | Drop D | 40.9% | 22 |
FB_226 | Standard | 40.9% | 22 |
FB_228 | Standard | 40.9% | 22 |
FB_001 | Half-Step Down | 41.7% | 24 |
Dataset
239 test cases
70 Half-Step Down
33 Drop Db
91 Standard
45 Drop D