Test Run #9 Analysis
Comparing model performance for the SYSTEMS Benchmark benchmark.
Global Filters
Languages
Models
Tags
Overall Avg. Score
0.555
Best Model
Claude 4 Sonnet
Highest Model Score
0.570
Comparing model performance for the SYSTEMS Benchmark benchmark.
0.555
Claude 4 Sonnet
0.570