Benchmark Results
Leaderboard
| # | Model | Score | Tuning Breakdown | Cost | Last Tested | |
|---|---|---|---|---|---|---|
| 1 | Qwen 3.5 FlashOW Alibaba | 94.5% | Std 97% DropD 97% HSD 93% DropDb 89% | $0.166 | Mar 9, 2026 | |
| 2 | DeepSeek V3.2 SpecialeOW DeepSeek | 94.5% | Std 98% DropD 97% HSD 93% DropDb 86% | $0.286 | Mar 10, 2026 | |
| 3 | Qwen 3.5 PlusOW Alibaba | 94.5% | Std 98% DropD 97% HSD 93% DropDb 86% | $0.378 | Mar 9, 2026 | |
| 4 | Kimi K2.5 (Reasoning)OW Moonshot | 94.5% | Std 98% DropD 97% HSD 93% DropDb 86% | $0.596 | Mar 10, 2026 | |
| 5 | DeepSeek V3.2 Speciale (Reasoning)OW DeepSeek | 93.4% | Std 97% DropD 97% HSD 93% DropDb 82% | $0.297 | Mar 10, 2026 | |
| 6 | Kimi K2.5OW Moonshot | 92.9% | Std 98% DropD 97% HSD 88% DropDb 86% | $0.614 | Mar 10, 2026 | |
| 7 | MiniMax M2.5OW MiniMax | 83.0% | Std 92% DropD 75% HSD 80% DropDb 75% | $0.113 | Mar 9, 2026 | |
| 8 | MiniMax M2.5 (Reasoning)OW MiniMax | 80.2% | Std 85% DropD 75% HSD 82% DropDb 71% | $0.110 | Mar 9, 2026 | |
| 9 | GPT-5.4 OpenAI | 74.2% | Std 89% DropD 81% HSD 59% DropDb 61% | $0.242 | Mar 9, 2026 | |
| 10 | Claude Opus 4.6 Anthropic | 68.7% | Std 82% DropD 59% HSD 59% DropDb 68% | $0.570 | Mar 9, 2026 | |
| 11 | Gemini 3.1 Flash Lite Google | 45.6% | Std 76% DropD 69% HSD 11% DropDb 18% | $0.024 | Mar 9, 2026 | |
| 12 | Gemini 3.1 Pro Google | 45.6% | Std 59% DropD 56% HSD 34% DropDb 25% | $0.317 | Mar 9, 2026 | |
| 13 | Llama 3.3 70BOW Meta | 28.0% | Std 24% DropD 13% HSD 41% DropDb 29% | $0.025 | Mar 9, 2026 | |
| 14 | Mistral LargeOW Mistral | 26.9% | Std 35% DropD 34% HSD 27% DropDb 0% | $0.049 | Mar 9, 2026 | |
| 15 | Claude Sonnet 4.6 Anthropic | 22.5% | Std 35% DropD 16% HSD 20% DropDb 7% | $0.431 | Mar 9, 2026 | |
| 16 | DeepSeek V3.2OW DeepSeek | 20.3% | Std 21% DropD 38% HSD 13% DropDb 14% | $0.021 | Mar 9, 2026 | |
| 17 | Claude Haiku 4.5 Anthropic | 19.2% | Std 26% DropD 31% HSD 7% DropDb 14% | $0.114 | Mar 9, 2026 | |
| 18 | Llama 4 ScoutOW Meta | 13.7% | Std 20% DropD 9% HSD 13% DropDb 7% | $0.012 | Mar 9, 2026 |
Tuning Difficulty
Drop Db
Half-Step Down
Drop D
Standard
Hardest Questions
| ID | Tuning | Success Rate | Attempts |
|---|---|---|---|
FB_047 | Drop D | 5.6% | 18 |
FB_050 | Drop Db | 5.6% | 18 |
FB_085 | Half-Step Down | 5.6% | 18 |
FB_028 | Half-Step Down | 11.1% | 18 |
FB_052 | Half-Step Down | 11.1% | 18 |
FB_080 | Drop Db | 11.1% | 18 |
FB_049 | Half-Step Down | 16.7% | 18 |
FB_044 | Drop Db | 22.2% | 18 |
FB_062 | Drop Db | 22.2% | 18 |
FB_148 | Standard | 33.3% | 18 |
Dataset
182 test cases
56 Half-Step Down
28 Drop Db
66 Standard
32 Drop D