Benchmarks

Benchmark or Breakthrough: GPT-5.4 and the Ramsey Hypergraph Question March 24, 2026
GPT-5.4 Pro improved a constant on a Ramsey bound in Epoch's FrontierMath benchmark. Here is what that actually means, and why the answer requires nuance.
Open-Weight vs Frontier: How Close Is the Accuracy Gap Really? March 22, 2026
Benchmark scores for open-weight models have converged with frontier cloud models on many tasks. But benchmarks measure what benchmarks measure. This is what the data actually says about where the gap is real and where it has closed.