Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming

Sama Hadhoud; Alaa Elsetohy; Frederikus Hudi; Jan Christian Blaise Cruz; Steven Halim; Alham Fikri Aji

arXiv:2601.11332·cs.CL·January 19, 2026

Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming

Sama Hadhoud, Alaa Elsetohy, Frederikus Hudi, Jan Christian Blaise Cruz, Steven Halim, Alham Fikri Aji

PDF

Open Access 1 Datasets

TL;DR

This paper emphasizes the importance of separating problem-solving reasoning from code implementation in evaluating LLMs for competitive programming, proposing editorial-based evaluation to better diagnose reasoning capabilities.

Contribution

It introduces a novel editorial-based evaluation framework, a new dataset with gold editorials, and demonstrates the benefits of focusing on reasoning over code execution in benchmarking LLMs.

Findings

01

Generating editorials improves solve rates for some LLMs.

02

Expert editorials significantly boost model performance.

03

Models struggle with implementation even when given gold editorials.

Abstract

Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

samahadhoud/idea-first-code-later-cp
dataset· 49 dl
49 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Artificial Intelligence in Healthcare and Education