Test Run #3 Analysis
Comparing model performance for the GPQA 2026 benchmark.
Global Filters
Languages
Models
Tags
Overall Avg. Score
0.583
Best Model
Gemini 2.5 Pro
Highest Model Score
0.583
Comparing model performance for the GPQA 2026 benchmark.
0.583
Gemini 2.5 Pro
0.583