CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding &   Reasoning Capabilities of CodeLLMs

Dung Nguyen Manh; Thang Phan Chau; Nam Le Hai; Thong T. Doan; Nam V.; Nguyen; Quang Pham; Nghi D. Q. Bui

arXiv:2410.01999·cs.SE·April 10, 2025

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs

Dung Nguyen Manh, Thang Phan Chau, Nam Le Hai, Thong T. Doan, Nam V., Nguyen, Quang Pham, Nghi D. Q. Bui

PDF

Open Access 1 Repo 1 Datasets

TL;DR

CodeMMLU is a comprehensive benchmark with nearly 20,000 questions designed to evaluate code understanding and reasoning capabilities of Code Large Language Models across multiple tasks and programming languages.

Contribution

It introduces a new multi-task benchmark focused on assessing deep code comprehension and reasoning, addressing a gap in existing code evaluation methods.

Findings

01

State-of-the-art models perform poorly on CodeMMLU tasks.

02

CodeMMLU reveals significant gaps in code understanding beyond generation.

03

Benchmark covers diverse domains and programming languages.

Abstract

Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a comprehensive multiple-choice benchmark designed to evaluate the depth of software and code comprehension in LLMs. CodeMMLU includes nearly 20,000 questions spanning diverse domains, including code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks that emphasize code generation, CodeMMLU assesses a model's ability to reason about programs across a wide-range of tasks such as code repair, execution reasoning, and fill-in-the-blank challenges. Our extensive evaluation reveals that even state-of-the-art models struggle with CodeMMLU, highlighting significant gaps in comprehension…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fsoft-ai4code/codemmlu
none

Datasets

Fsoft-AIC/CodeMMLU
dataset· 784 dl
784 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · E-Learning and Knowledge Management · Software System Performance and Reliability