
Two New Benchmarks Push AI Evaluation Into Research-Level Math and 3D Geometric Reasoning
Introduction
For years, the math benchmarks used to measure AI progress held. Then they did not. OpenAI's o1 model scores 94.8% on the MATH dataset. This collection of competition-level problems was expected to challenge AI systems...
Create an account to read this article
Sign up for a free account to get full access to in-depth AI coverage, analysis, and investigations.