COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models

James Meaden; Micha{\l} Jarosz; Piotr Jod{\l}owski; Grigori Melnik

arXiv:2508.13757·cs.SE·August 20, 2025

COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models

James Meaden, Micha{\l} Jarosz, Piotr Jod{\l}owski, Grigori Melnik

PDF

TL;DR

COMPASS is a comprehensive benchmark for code generation models that evaluates correctness, efficiency, and quality, revealing that high correctness does not imply optimal efficiency or maintainability.

Contribution

Introduces COMPASS, a multi-dimensional evaluation framework for code generation that incorporates real-world metrics like efficiency and code quality, beyond correctness.

Findings

01

Models with high correctness scores often lack efficiency.

02

Efficiency and quality are not guaranteed by correctness alone.

03

COMPASS provides a more holistic assessment of code generation models.

Abstract

Current code generation benchmarks focus primarily on functional correctness while overlooking two critical aspects of real-world programming: algorithmic efficiency and code quality. We introduce COMPASS (COdility's Multi-dimensional Programming ASSessment), a comprehensive evaluation framework that assesses code generation across three dimensions: correctness, efficiency, and quality. COMPASS consists of 50 competitive programming problems from real Codility competitions, providing authentic human baselines from 393,150 submissions. Unlike existing benchmarks that treat algorithmically inefficient solutions identically to optimal ones provided they pass test cases, COMPASS systematically evaluates runtime efficiency and code quality using industry-standard analysis tools. Our evaluation of three leading reasoning-enhanced models, Anthropic Claude Opus 4, Google Gemini 2.5 Pro, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.