Benchmarks
GPQA 2026
Official General-Purpose Question Answering benchmark for 2026.
INTERFACES Benchmark
Corrupti speciosus voluptas vigor rem commodi cornu vis cubitum thymbra.
APPLICATIONS Benchmark
Baiulus claudeo surgo acer casso animi versus dedico umbra.
LARGE LANGUAGE MODELS Benchmark
Surculus cetera circumvenio bene aspernatur somnus natus.
CONVERGENCE Benchmark
Cras cavus ipsum theologus commodo claro temptatio.
Recent Test Runs
Test Run #1 - GPQA 2026
Models: GPT O3, Claude 4 Sonnet, Llama 4 | Languages: it, fr, en
FAILED23 days ago
Test Run #2 - APPLICATIONS Benchmark
Models: Llama 4 | Languages: es, en, de, fr, it
COMPLETED5 days ago
Test Run #3 - CONVERGENCE Benchmark
Models: GPT O3, Claude 4 Sonnet | Languages: fr, it
RUNNING4 days ago
Test Run #4 - GPQA 2026
Models: Claude 4 Sonnet, Gemini 2.5 Pro, GPT O3, Llama 4 | Languages: fr, en, es, de
FAILED27 days ago
Test Run #5 - GPQA 2026
Models: Llama 4, Claude 4 Sonnet | Languages: de, es, it, en
RUNNING12 days ago
Test Run #6 - INTERFACES Benchmark
Models: Llama 4, Gemini 2.5 Pro, GPT O3 | Languages: de, it, en, fr
FAILED10 days ago
Test Run #7 - GPQA 2026
Models: Claude 4 Sonnet, Gemini 2.5 Pro, Llama 4 | Languages: es, fr
COMPLETED13 days ago
Test Run #8 - LARGE LANGUAGE MODELS Benchmark
Models: Claude 4 Sonnet, Gemini 2.5 Pro, GPT O3 | Languages: es, en, de, it, fr
RUNNING18 days ago
Test Run #9 - APPLICATIONS Benchmark
Models: Claude 4 Sonnet, Gemini 2.5 Pro | Languages: fr, de, it
RUNNING28 days ago
Test Run #10 - INTERFACES Benchmark
Models: Claude 4 Sonnet, Gemini 2.5 Pro, Llama 4 | Languages: de, it
RUNNING25 days ago