An Empirical Study of Reasoning Steps in Thinking Code LLMs

Haoran Xue; Gias Uddin; Song Wang

arXiv:2511.05874·cs.AI·November 11, 2025

An Empirical Study of Reasoning Steps in Thinking Code LLMs

Haoran Xue, Gias Uddin, Song Wang

PDF

Open Access

TL;DR

This empirical study investigates the reasoning processes of large language models in code generation, analyzing their reasoning chains, effectiveness, and failure modes across various tasks to improve understanding and future development.

Contribution

It systematically evaluates reasoning chains in multiple state-of-the-art LLMs, introduces a taxonomy of reasoning issues, and examines how task complexity affects reasoning quality.

Findings

01

Targeted step increases can improve success rates on some models/tasks.

02

Modest reductions in reasoning steps often preserve success on standard tasks.

03

Incompleteness is the primary failure mode, especially on hard problems.

Abstract

Thinking Large Language Models (LLMs) generate explicit intermediate reasoning traces before final answers, potentially improving transparency, interpretability, and solution accuracy for code generation. However, the quality of these reasoning chains remains underexplored. We present a comprehensive empirical study examining the reasoning process and quality of thinking LLMs for code generation. We evaluate six state-of-the-art reasoning LLMs (DeepSeek-R1, OpenAI-o3-mini, Claude-3.7-Sonnet-Thinking, Gemini-2.0-Flash-Thinking, Gemini-2.5-Flash, and Qwen-QwQ) across 100 code generation tasks of varying difficulty from BigCodeBench. We quantify reasoning-chain structure through step counts and verbosity, conduct controlled step-budget adjustments, and perform a 21-participant human evaluation across three dimensions: efficiency, logical correctness, and completeness. Our step-count…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Software Engineering Research