Test Run #3 Analysis

Comparing model performance for the GPQA 2026 benchmark.

Global Filters

Languages

Models

Tags

Overall Avg. Score

0.583

Best Model

Gemini 2.5 Pro

Highest Model Score

0.583

Model Scores per Language

© 2025 LLM Benchmarker. All rights reserved.