Benchmarks
GPQA 2026
Official General-Purpose Question Answering benchmark for 2026.
FUNCTIONALITIES Benchmark
Textilis tamdiu coaegresco angelus pauci curso totam virga.
CONTENT Benchmark
Virga toties vallum arto acsi perspiciatis decumbo delibero.
FUNCTIONALITIES Benchmark
Antea cornu surculus coniuratio accusator vado sumo deludo viscus.
INTERFACES Benchmark
Amita aggredior paens annus coadunatio nam.
Recent Test Runs
Test Run #1 - GPQA 2026
Models: Gemini 2.5 Pro, Claude 4 Sonnet, Llama 4 | Languages: en, de, it, fr
RUNNING17 days ago
Test Run #2 - FUNCTIONALITIES Benchmark
Models: GPT O3, Gemini 2.5 Pro | Languages: en, fr, de
RUNNING6 days ago
Test Run #3 - INTERFACES Benchmark
Models: GPT O3, Claude 4 Sonnet, Gemini 2.5 Pro | Languages: fr, en, es, it, de
FAILED17 days ago
Test Run #4 - CONTENT Benchmark
Models: GPT O3, Llama 4 | Languages: en, fr
FAILED20 days ago
Test Run #5 - FUNCTIONALITIES Benchmark
Models: Gemini 2.5 Pro, Llama 4, Claude 4 Sonnet, GPT O3 | Languages: it, es, de, fr, en
FAILED29 days ago
Test Run #6 - INTERFACES Benchmark
Models: GPT O3, Llama 4, Claude 4 Sonnet, Gemini 2.5 Pro | Languages: it, en, de
RUNNING2 days ago
Test Run #7 - INTERFACES Benchmark
Models: Gemini 2.5 Pro | Languages: it, en, fr
COMPLETED24 days ago
Test Run #8 - FUNCTIONALITIES Benchmark
Models: Gemini 2.5 Pro, Claude 4 Sonnet | Languages: it, es, en, de
COMPLETED28 days ago
Test Run #9 - FUNCTIONALITIES Benchmark
Models: GPT O3, Claude 4 Sonnet, Gemini 2.5 Pro | Languages: de, it, en, fr, es
COMPLETED26 days ago
Test Run #10 - FUNCTIONALITIES Benchmark
Models: Claude 4 Sonnet, Llama 4, Gemini 2.5 Pro, GPT O3 | Languages: de, en, fr, it, es
FAILED27 days ago