Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation
Binquan Zhang, Li Zhang, Zhiwen Luo, Yuxin Du, Fang Liu, Song Wang, Lin Shi

TL;DR
This paper investigates the quality of chain-of-thought reasoning in LLM-based code generation, analyzing factors affecting CoT quality, their impact on code correctness, and how refining CoTs can enhance performance.
Contribution
It provides an empirical analysis of CoT quality in LLM code generation, identifying external and internal factors influencing CoT effectiveness and demonstrating that refining CoTs can improve code accuracy.
Findings
External factors like unclear requirements significantly affect CoT quality
A notable percentage of correct code pairs with flawed CoTs
Refining CoTs with detailed descriptions improves code correctness
Abstract
Large language models (LLMs) have demonstrated impressive performance in code generation, particularly when augmented with chain-of-thought (CoT) prompting techniques. They break down requirements into intermediate reasoning steps, which act as design rationales to guide LLMs in writing code like human programmers. Thus, the quality of these steps is crucial for ensuring the correctness and reliability of the generated code. However, little is known about the quality of CoT generated by LLMs. To what extent can we trust the thoughts generated by LLMs? How good are they? This paper empirically explores the external and internal factors of why LLMs generate unsatisfactory CoTs by analyzing 1,023 failed code samples on two widely used code generation benchmarks. We also evaluate their impact on code generation performance by analyzing 210 CoT-code pairs and refining the unsatisfied CoTs by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Artificial Intelligence in Healthcare and Education
