Test Run #3 Analysis
Comparing model performance for the E-COMMERCE Benchmark benchmark.
Global Filters
Languages
Models
Tags
Overall Avg. Score
0.543
Best Model
Llama 4
Highest Model Score
0.579
Comparing model performance for the E-COMMERCE Benchmark benchmark.
0.543
Llama 4
0.579