Benchmark Results

Eval v1.0.0 Last updated Apr 19, 2026

Score vs Cost

Leaderboard

# Model Score Tuning Breakdown Cost Last Tested
1 Gemini 3.1 Pro Google 100.0%
Std 100% DropD 100% HSD 100% DropDb 100%
$1.296 Apr 11, 2026
2 DeepSeek V3.2 SpecialeOW DeepSeek 99.6%
Std 99% DropD 100% HSD 100% DropDb 100%
$0.423 Mar 10, 2026
3 DeepSeek V3.2 Speciale (Reasoning)OW DeepSeek 99.6%
Std 99% DropD 100% HSD 100% DropDb 100%
$0.437 Mar 10, 2026
4 Qwen 3.5 PlusOW Alibaba 99.6%
Std 99% DropD 100% HSD 100% DropDb 100%
$0.605 Mar 9, 2026
5 Kimi K2.5 (Reasoning)OW Moonshot 99.6%
Std 99% DropD 100% HSD 100% DropDb 100%
$0.625 Mar 10, 2026
6 Kimi K2.5OW Moonshot 99.6%
Std 99% DropD 100% HSD 100% DropDb 100%
$0.649 Mar 10, 2026
7 Qwen 3.5 FlashOW Alibaba 98.3%
Std 99% DropD 100% HSD 99% DropDb 94%
$0.231 Mar 9, 2026
8 Claude Opus 4.7 Anthropic 87.2%
Std 87% DropD 88% HSD 89% DropDb 83%
$2.378 Apr 19, 2026
9 MiniMax M2.5OW MiniMax 84.9%
Std 85% DropD 80% HSD 87% DropDb 88%
$0.157 Mar 9, 2026
10 MiniMax M2.5 (Reasoning)OW MiniMax 82.8%
Std 85% DropD 78% HSD 81% DropDb 88%
$0.159 Mar 9, 2026
11 GPT-5.4 OpenAI 63.2%
Std 74% DropD 69% HSD 49% DropDb 58%
$0.319 Mar 9, 2026
12 Claude Opus 4.6 Anthropic 59.8%
Std 69% DropD 51% HSD 54% DropDb 58%
$0.747 Mar 9, 2026
13 Gemma 4 31BOW Google 51.7%
Std 74% DropD 58% HSD 43% DropDb 28%
$0.034 Apr 19, 2026
14 Gemini 3.1 Flash Lite Google 38.5%
Std 65% DropD 53% HSD 10% DropDb 6%
$0.031 Mar 9, 2026
15 Gemma 4 26B A4BOW Google 36.2%
Std 59% DropD 52% HSD 19% DropDb 5%
$0.025 Apr 19, 2026
16 Llama 3.3 70BOW Meta 24.7%
Std 20% DropD 13% HSD 39% DropDb 24%
$0.031 Mar 9, 2026
17 Mistral LargeOW Mistral 24.7%
Std 30% DropD 31% HSD 24% DropDb 3%
$0.053 Mar 9, 2026
18 Claude Sonnet 4.6 Anthropic 20.5%
Std 30% DropD 20% HSD 16% DropDb 6%
$0.590 Mar 9, 2026
19 DeepSeek V3.2OW DeepSeek 20.1%
Std 21% DropD 33% HSD 16% DropDb 9%
$0.029 Mar 9, 2026
20 Claude Haiku 4.5 Anthropic 17.6%
Std 21% DropD 27% HSD 10% DropDb 12%
$0.168 Mar 9, 2026
21 Llama 4 ScoutOW Meta 13.8%
Std 19% DropD 13% HSD 11% DropDb 6%
$0.016 Mar 9, 2026

Tuning Difficulty

Drop Db
Half-Step Down
Drop D
Standard

Hardest Questions

ID Tuning Success Rate Attempts
FB_172 Drop D 33.3% 24
FB_209 Standard 36.4% 22
FB_148 Standard 37.5% 24
FB_169 Standard 37.5% 24
FB_202 Standard 40.9% 22
FB_204 Half-Step Down 40.9% 22
FB_225 Drop D 40.9% 22
FB_226 Standard 40.9% 22
FB_228 Standard 40.9% 22
FB_001 Half-Step Down 41.7% 24

Dataset

239 test cases
70 Half-Step Down
33 Drop Db
91 Standard
45 Drop D