COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models
James Meaden, Micha{\l} Jarosz, Piotr Jod{\l}owski, Grigori Melnik

TL;DR
COMPASS is a comprehensive benchmark for code generation models that evaluates correctness, efficiency, and quality, revealing that high correctness does not imply optimal efficiency or maintainability.
Contribution
Introduces COMPASS, a multi-dimensional evaluation framework for code generation that incorporates real-world metrics like efficiency and code quality, beyond correctness.
Findings
Models with high correctness scores often lack efficiency.
Efficiency and quality are not guaranteed by correctness alone.
COMPASS provides a more holistic assessment of code generation models.
Abstract
Current code generation benchmarks focus primarily on functional correctness while overlooking two critical aspects of real-world programming: algorithmic efficiency and code quality. We introduce COMPASS (COdility's Multi-dimensional Programming ASSessment), a comprehensive evaluation framework that assesses code generation across three dimensions: correctness, efficiency, and quality. COMPASS consists of 50 competitive programming problems from real Codility competitions, providing authentic human baselines from 393,150 submissions. Unlike existing benchmarks that treat algorithmically inefficient solutions identically to optimal ones provided they pass test cases, COMPASS systematically evaluates runtime efficiency and code quality using industry-standard analysis tools. Our evaluation of three leading reasoning-enhanced models, Anthropic Claude Opus 4, Google Gemini 2.5 Pro, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
