Benchmarks
GPQA 2026
Official General-Purpose Question Answering benchmark for 2026.
INITIATIVES Benchmark
Vilitas super utrimque conatus voluptatem.
MODELS Benchmark
Bis tandem nesciunt consuasor dolor adaugeo veritatis usque reprehenderit textor.
MODELS Benchmark
Bos vulgo contra decens molestias omnis appositus adhaero.
SMART CONTRACTS Benchmark
Curvo ascisco torrens tantillus casso antiquus utroque arguo.
Recent Test Runs
Test Run #1 - MODELS Benchmark
Models: Llama 4, Claude 4 Sonnet, GPT O3 | Languages: it, de, fr, en, es
FAILED24 days ago
Test Run #2 - SMART CONTRACTS Benchmark
Models: Claude 4 Sonnet | Languages: fr, it
RUNNING6 days ago
Test Run #3 - MODELS Benchmark
Models: Gemini 2.5 Pro, Claude 4 Sonnet, Llama 4, GPT O3 | Languages: de, en, fr, it
COMPLETED17 days ago
Test Run #4 - SMART CONTRACTS Benchmark
Models: Llama 4 | Languages: fr, es
RUNNING22 days ago
Test Run #5 - MODELS Benchmark
Models: Llama 4 | Languages: it, fr
RUNNING10 days ago
Test Run #6 - SMART CONTRACTS Benchmark
Models: Llama 4, Claude 4 Sonnet | Languages: es, en, de, fr, it
FAILED23 days ago
Test Run #7 - MODELS Benchmark
Models: GPT O3, Llama 4, Gemini 2.5 Pro, Claude 4 Sonnet | Languages: de, en, it, fr
COMPLETED19 days ago
Test Run #8 - MODELS Benchmark
Models: Gemini 2.5 Pro, Llama 4 | Languages: de, fr, it, en, es
FAILED3 days ago
Test Run #9 - MODELS Benchmark
Models: Llama 4 | Languages: it, en, fr
FAILED11 days ago
Test Run #10 - INITIATIVES Benchmark
Models: Llama 4, Claude 4 Sonnet, Gemini 2.5 Pro | Languages: it, de, en
RUNNING12 days ago