AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
Jason Chou, Ao Liu, Yuchi Deng, Zhiying Zeng, Tao Zhang, Haotian Zhu, Jianwei Cai, Yue Mao, Chenchen Zhang, Lingyun Tan, Ziyan Xu, Bohui Zhai, Hengyi Liu, Speed Zhu, Wiggin Zhou, Fengzong Lian

TL;DR
AutoCodeBench introduces an automated, multilingual, high-difficulty code generation benchmark generated without manual annotations, enabling comprehensive evaluation of large language models' code capabilities across diverse languages and complexities.
Contribution
The paper presents AutoCodeGen, a novel automated method for creating challenging multilingual code datasets, and introduces AutoCodeBench, a large-scale benchmark with 3,920 problems across 20 languages.
Findings
Most advanced LLMs struggle with complex multilingual tasks.
AutoCodeBench provides a challenging evaluation environment for code generation.
The benchmark covers diverse languages and problem difficulties.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities, these benchmarks face several critical limitations. First, they often rely on manual annotations, which are time-consuming and difficult to scale across different programming languages and problem complexities. Second, most existing benchmarks focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and uneven language distribution. To address these challenges, we propose AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations. AutoCodeGen ensures the correctness and completeness of test cases by generating test inputs with LLMs and obtaining…
Peer Reviews
Decision·ICLR 2026 Poster
1. The greatest contribution of this paper is the AutoCodeGen framework, which automatically generates benchmark data and attempts to overcome the reliance on expensive and time-consuming manual annotation. Its highlight lies in ensuring the correctness of the generated data: first, generating test inputs and then obtaining outputs through sandbox execution to avoid potential errors when LLMs directly generate test cases; second, reversely generating problems based on answers and test cases to e
The paper states that AutoCodeBench aims to conduct strict evaluations of large language models on diverse, high-difficulty, and realistic multi-language programming tasks. However, to achieve this goal, there are areas where the paper can be improved: 1. Insufficient Difficulty Assessment in the AutoCodeGen Framework In Section 2.1.4, the authors mention the method for difficulty control: "we employ a moderately capable code model, DeepSeek-Coder-V2-Lite, to filter out too easy problems. Specif
1. The automation of the benchmark is good, which eliminates the need for manual annotations. This is a significant advantage for scaling code generation evaluation across languages and problem complexities. 2. AutoCodeBench contains 20 languages and high-difficulty problems, which makes it a robust benchmark for multilingual code generation tasks. 3. The paper is easy to follow.
1. The authors should pay attention to other low-resource languages. 2. The system heavily relies on LLMs for generating code solutions and test cases, which can introduces biases or errors inherent to the models being used, especially if those models are trained on flawed data. 3. The paper focuses on evaluating the code generation of models but does not address the broader issue of how benchmarks extends to other domains.
1. The authors have conducted a comprehensive evaluation across a very large number of models. This provides a broad and valuable snapshot of the current landscape of code generation capabilities as measured by their proposed benchmark. 2. The strategy of generating test inputs first and then using a sandbox to obtain ground-truth outputs is a clever and effective method for ensuring the correctness of the generated test cases. I am looking forward to the authors' response and would be happy to
The paper's central claim of high difficulty is not sufficiently deconstructed or justified. The methodology for ensuring difficulty involves a post-hoc filtering step that removes any problem solvable by a "moderately capable model" (Line 195). This approach risks conflating genuine, meaningful difficulty with other confounding factors. Specifically, the source of difficulty remains ambiguous: - Is a problem difficult because it requires complex algorithmic reasoning or deep domain knowledge?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Machine Learning in Materials Science · Topic Modeling
