Test Run #6 Analysis
Comparing model performance for the GPQA 2026 benchmark.
Global Filters
Languages
Models
Tags
Overall Avg. Score
0.534
Best Model
Llama 4
Highest Model Score
0.534
Comparing model performance for the GPQA 2026 benchmark.
0.534
Llama 4
0.534