Test Run #7 Analysis

Comparing model performance for the GPQA 2026 benchmark.

Global Filters

Languages

Models

Tags

Overall Avg. Score

0.556

Best Model

Claude 4 Sonnet

Highest Model Score

0.609

Model Scores per Language

© 2025 LLM Benchmarker. All rights reserved.