Benchmark Results

Dataset 577c513b40c4537ed1b65a3acbe49db2d73e4ca1 Eval v1.0.0 Last updated Mar 10, 2026

Leaderboard

# Model Score Tuning Breakdown Cost Last Tested
1 Qwen 3.5 FlashOW Alibaba 94.5%
Std 97% DropD 97% HSD 93% DropDb 89%
$0.166 Mar 9, 2026
2 DeepSeek V3.2 SpecialeOW DeepSeek 94.5%
Std 98% DropD 97% HSD 93% DropDb 86%
$0.286 Mar 10, 2026
3 Qwen 3.5 PlusOW Alibaba 94.5%
Std 98% DropD 97% HSD 93% DropDb 86%
$0.378 Mar 9, 2026
4 Kimi K2.5 (Reasoning)OW Moonshot 94.5%
Std 98% DropD 97% HSD 93% DropDb 86%
$0.596 Mar 10, 2026
5 DeepSeek V3.2 Speciale (Reasoning)OW DeepSeek 93.4%
Std 97% DropD 97% HSD 93% DropDb 82%
$0.297 Mar 10, 2026
6 Kimi K2.5OW Moonshot 92.9%
Std 98% DropD 97% HSD 88% DropDb 86%
$0.614 Mar 10, 2026
7 MiniMax M2.5OW MiniMax 83.0%
Std 92% DropD 75% HSD 80% DropDb 75%
$0.113 Mar 9, 2026
8 MiniMax M2.5 (Reasoning)OW MiniMax 80.2%
Std 85% DropD 75% HSD 82% DropDb 71%
$0.110 Mar 9, 2026
9 GPT-5.4 OpenAI 74.2%
Std 89% DropD 81% HSD 59% DropDb 61%
$0.242 Mar 9, 2026
10 Claude Opus 4.6 Anthropic 68.7%
Std 82% DropD 59% HSD 59% DropDb 68%
$0.570 Mar 9, 2026
11 Gemini 3.1 Flash Lite Google 45.6%
Std 76% DropD 69% HSD 11% DropDb 18%
$0.024 Mar 9, 2026
12 Gemini 3.1 Pro Google 45.6%
Std 59% DropD 56% HSD 34% DropDb 25%
$0.317 Mar 9, 2026
13 Llama 3.3 70BOW Meta 28.0%
Std 24% DropD 13% HSD 41% DropDb 29%
$0.025 Mar 9, 2026
14 Mistral LargeOW Mistral 26.9%
Std 35% DropD 34% HSD 27% DropDb 0%
$0.049 Mar 9, 2026
15 Claude Sonnet 4.6 Anthropic 22.5%
Std 35% DropD 16% HSD 20% DropDb 7%
$0.431 Mar 9, 2026
16 DeepSeek V3.2OW DeepSeek 20.3%
Std 21% DropD 38% HSD 13% DropDb 14%
$0.021 Mar 9, 2026
17 Claude Haiku 4.5 Anthropic 19.2%
Std 26% DropD 31% HSD 7% DropDb 14%
$0.114 Mar 9, 2026
18 Llama 4 ScoutOW Meta 13.7%
Std 20% DropD 9% HSD 13% DropDb 7%
$0.012 Mar 9, 2026

Tuning Difficulty

Drop Db
Half-Step Down
Drop D
Standard

Hardest Questions

ID Tuning Success Rate Attempts
FB_047 Drop D 5.6% 18
FB_050 Drop Db 5.6% 18
FB_085 Half-Step Down 5.6% 18
FB_028 Half-Step Down 11.1% 18
FB_052 Half-Step Down 11.1% 18
FB_080 Drop Db 11.1% 18
FB_049 Half-Step Down 16.7% 18
FB_044 Drop Db 22.2% 18
FB_062 Drop Db 22.2% 18
FB_148 Standard 33.3% 18

Dataset

182 test cases
56 Half-Step Down
28 Drop Db
66 Standard
32 Drop D