Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation

Binquan Zhang; Li Zhang; Zhiwen Luo; Yuxin Du; Fang Liu; Song Wang; Lin Shi

arXiv:2507.06980·cs.SE·July 10, 2025

Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation

Binquan Zhang, Li Zhang, Zhiwen Luo, Yuxin Du, Fang Liu, Song Wang, Lin Shi

PDF

Open Access

TL;DR

This paper investigates the quality of chain-of-thought reasoning in LLM-based code generation, analyzing factors affecting CoT quality, their impact on code correctness, and how refining CoTs can enhance performance.

Contribution

It provides an empirical analysis of CoT quality in LLM code generation, identifying external and internal factors influencing CoT effectiveness and demonstrating that refining CoTs can improve code accuracy.

Findings

01

External factors like unclear requirements significantly affect CoT quality

02

A notable percentage of correct code pairs with flawed CoTs

03

Refining CoTs with detailed descriptions improves code correctness

Abstract

Large language models (LLMs) have demonstrated impressive performance in code generation, particularly when augmented with chain-of-thought (CoT) prompting techniques. They break down requirements into intermediate reasoning steps, which act as design rationales to guide LLMs in writing code like human programmers. Thus, the quality of these steps is crucial for ensuring the correctness and reliability of the generated code. However, little is known about the quality of CoT generated by LLMs. To what extent can we trust the thoughts generated by LLMs? How good are they? This paper empirically explores the external and internal factors of why LLMs generate unsatisfactory CoTs by analyzing 1,023 failed code samples on two widely used code generation benchmarks. We also evaluate their impact on code generation performance by analyzing 210 CoT-code pairs and refining the unsatisfied CoTs by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software Engineering Research · Artificial Intelligence in Healthcare and Education