OJBench: A Competition Level Code Benchmark For Large Language Models
Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao, Muxi Diao, Yanxu Chen, Kelin Fu, Flood Sung, Zhilin Yang, Tianyu Liu, Weiran Xu

TL;DR
OJBench is a new, challenging benchmark with 232 competitive programming problems designed to evaluate the code reasoning abilities of large language models at a level comparable to programming competitions.
Contribution
The paper introduces OJBench, a comprehensive benchmark for assessing the competitive-level code reasoning skills of large language models, filling a gap in existing evaluation tools.
Findings
State-of-the-art models still struggle with competition-level problems.
OJBench reveals significant challenges in current LLMs' code reasoning.
Evaluation on 37 models demonstrates the benchmark's rigor.
Abstract
Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these capabilities, particularly at the competitive level. To bridge this gap, we introduce OJBench, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a more rigorous test of models' reasoning skills. We conducted a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models. Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper is easy to follow. It introduces the first fully competition-level code reasoning benchmark (OJBench) derived from real NOI and ICPC problems, addressing a clear gap beyond existing datasets like LiveCodeBench or CodeElo. 2. This paper shows how models can improve with execution-based refinement, providing insight into iterative reasoning and debugging behavior. 3. The authors plan to release dataset, codebase, and evaluation pipeline, ensuring transparency and long-term benchmark
1. The analysis and conclusion are not deep; I did not gain new insights beyond the benchmark construction. 2. The paper mainly focuses on dataset creation and evaluation, which may fit better in a dataset or benchmark track rather than the main ICLR track.
This is a complete paper: the data pipeline is effective, and the experiments are comprehensive. However, the weaknesses are, overall, quite serious.
- In line 047, the authors claim that CodeElo is not standardized and transparent, which are pivotal limitations of current coding benchmarks. In my view, this claim may not hold. A transparent benchmark can exacerbate data contamination, which is unacceptable. Moreover, if the authors argue that existing coding benchmarks may suffer evaluation bias due to the choice of problems, then OJBench would require strict quality control, diversity control, problem selection control, and robust data cont
1. The selection of problems from NOI and ICPC is a clear strength, providing a high-difficulty set of tasks that effectively challenges the current generation of LLMs. 2. The evaluation of 37 different models is comprehensive. It provides a valuable, wide-ranging snapshot of the current landscape, from general-purpose coders to specialized reasoning models. 3. The inclusion of both C++ and Python evaluations is a thoughtful touch that reflects real-world competitive programming. Furthermore,
1. The most glaring omission is the absence of a data contamination analysis. Problems from high-profile competitions like NOI and ICPC are extensively discussed online, with countless solutions, tutorials, and analyses available on platforms like GitHub, blogs, and forums. It is almost certain that this data is present in the training corpora of the models being evaluated. Without a rigorous decontamination study to identify and potentially exclude contaminated problems, the benchmark's results
The paper demonstrates strong originality by introducing the first benchmark explicitly focused on competition-level programming tasks, bridging the gap between existing simple code benchmarks and the high-level reasoning challenges posed by contests like NOI and ICPC. In terms of quality, the dataset construction is rigorous: problems are carefully curated, filtered, translated, and annotated with difficulty levels based on real contest data. The evaluation framework is robust, using execution
Although this paper is very solid, there already exist many similar works. I hope the authors can highlight more clearly how OJBench differs from other competitive programming benchmarks such as LiveCodeBench, LiveCodeBench Pro, ACOBench, and CodeELO, and what specific advantages OJBench provides over them. Since the level of novelty is somewhat limited, I would also encourage the authors to conduct more interesting analytical experiments that can yield unique insights. If OJBench can demonstra
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Materials Science
