Benchmarks
GPQA 2026
Official General-Purpose Question Answering benchmark for 2026.
NICHES Benchmark
Accusantium distinctio dedecor vis videlicet caelum mollitia culpa.
SYNERGIES Benchmark
Cuppedia vir cruciamentum tenax textus sint defendo compello.
CONTENT Benchmark
Capillus aurum cras.
E-COMMERCE Benchmark
Cunae stips utor curtus antea auditor sed.
Recent Test Runs
Test Run #1 - CONTENT Benchmark
Models: Gemini 2.5 Pro, Claude 4 Sonnet | Languages: fr, es, en, it, de
RUNNING10 days ago
Test Run #2 - NICHES Benchmark
Models: Claude 4 Sonnet | Languages: de, it, en, es, fr
COMPLETED21 days ago
Test Run #3 - GPQA 2026
Models: Claude 4 Sonnet | Languages: de, es, fr, it, en
RUNNING29 days ago
Test Run #4 - CONTENT Benchmark
Models: Gemini 2.5 Pro, GPT O3, Claude 4 Sonnet | Languages: en, es, it, fr, de
FAILED15 days ago
Test Run #5 - E-COMMERCE Benchmark
Models: Claude 4 Sonnet, Llama 4 | Languages: it, en, fr, es, de
FAILEDyesterday
Test Run #6 - GPQA 2026
Models: Claude 4 Sonnet | Languages: de, es, en, it, fr
COMPLETED5 days ago
Test Run #7 - NICHES Benchmark
Models: Llama 4, GPT O3, Claude 4 Sonnet | Languages: de, it, fr, en, es
RUNNING29 days ago
Test Run #8 - CONTENT Benchmark
Models: Llama 4, Gemini 2.5 Pro, GPT O3 | Languages: en, fr, es, it
FAILED12 days ago
Test Run #9 - CONTENT Benchmark
Models: Gemini 2.5 Pro, GPT O3 | Languages: es, it, en, fr, de
FAILED3 days ago
Test Run #10 - GPQA 2026
Models: Gemini 2.5 Pro | Languages: de, es
FAILED10 days ago