Test Run #9 Analysis
Comparing model performance for the GPQA 2026 benchmark.
Global Filters
Languages
Models
Tags
Overall Avg. Score
0.515
Best Model
Claude 4 Sonnet
Highest Model Score
0.515
Comparing model performance for the GPQA 2026 benchmark.
0.515
Claude 4 Sonnet
0.515