An Empirical Study of Reasoning Steps in Thinking Code LLMs
Haoran Xue, Gias Uddin, Song Wang

TL;DR
This empirical study investigates the reasoning processes of large language models in code generation, analyzing their reasoning chains, effectiveness, and failure modes across various tasks to improve understanding and future development.
Contribution
It systematically evaluates reasoning chains in multiple state-of-the-art LLMs, introduces a taxonomy of reasoning issues, and examines how task complexity affects reasoning quality.
Findings
Targeted step increases can improve success rates on some models/tasks.
Modest reductions in reasoning steps often preserve success on standard tasks.
Incompleteness is the primary failure mode, especially on hard problems.
Abstract
Thinking Large Language Models (LLMs) generate explicit intermediate reasoning traces before final answers, potentially improving transparency, interpretability, and solution accuracy for code generation. However, the quality of these reasoning chains remains underexplored. We present a comprehensive empirical study examining the reasoning process and quality of thinking LLMs for code generation. We evaluate six state-of-the-art reasoning LLMs (DeepSeek-R1, OpenAI-o3-mini, Claude-3.7-Sonnet-Thinking, Gemini-2.0-Flash-Thinking, Gemini-2.5-Flash, and Qwen-QwQ) across 100 code generation tasks of varying difficulty from BigCodeBench. We quantify reasoning-chain structure through step counts and verbosity, conduct controlled step-budget adjustments, and perform a 21-participant human evaluation across three dimensions: efficiency, logical correctness, and completeness. Our step-count…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Software Engineering Research
