What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Kunhao Zheng; Juliette Decugis; Jonas Gehring; Taco Cohen; Benjamin; Negrevergne; Gabriel Synnaeve

arXiv:2410.08105·cs.CL·April 9, 2025

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Kunhao Zheng, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin, Negrevergne, Gabriel Synnaeve

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper systematically investigates prompting strategies, especially multi-turn re-prompting, to enhance large language models' reasoning in code generation, demonstrating improvements through both prompting and finetuning on competitive programming benchmarks.

Contribution

It provides a comprehensive analysis of prompting techniques for multi-turn code reasoning and shows how finetuning with these strategies improves model performance and scalability.

Findings

01

Certain prompting strategies consistently improve performance across models.

02

Finetuning with optimal prompts internalizes reasoning, boosting multi-turn code generation.

03

Strategies are effective for both small and large sampling budgets.

Abstract

Prompting techniques such as chain-of-thought have established themselves as a popular vehicle for improving the outputs of large language models (LLMs). For code generation, however, their exact mechanics and efficacy are under-explored. We thus investigate the effects of a wide range of prompting strategies with a focus on automatic re-prompting over multiple turns and computational requirements. After systematically decomposing reasoning, instruction, and execution feedback prompts, we conduct an extensive grid search on the competitive programming benchmarks CodeContests and TACO for multiple LLM families and sizes (Llama 3.0 and 3.1, 8B, 70B, 405B, and GPT-4o). Our study reveals strategies that consistently improve performance across all models with small and large sampling budgets. We then show how finetuning with such an optimal configuration allows models to internalize the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

**Originality**: This work provides a unique contribution by systematically exploring multi-turn code generation for large language models (LLMs) through diverse prompt configurations, including **chain-of-thought (CoT)**, **reasoning prompts**, and **execution feedback**. While prior work has explored CoT and single-turn code generation, this paper’s emphasis on **multi-turn prompting in competitive programming benchmarks** addresses an under-explored area. The study’s integration of multi-tu

Weaknesses

While the paper makes significant contributions to the understanding of multi-turn code generation in LLMs, there are a few areas where improvements could be made to strengthen the study further. **1. Limited Scope of Multi-Turn Framework** The paper’s framework, while well-designed for competitive programming benchmarks, focuses primarily on **chain-style multi-turn code generation**. This approach limits the exploration of more complex **tree-structured trajectories** or **backtracking mech

Reviewer 02Rating 5Confidence 4

Strengths

- The paper brings together the evaluation of a wide range of prompting strategies in both single turn and multi-turn for multiple models on a challenging code dataset. - Extensive set of experiments with an extensive grid search - Interesting idea of splitting the CoT prompts into reasoning, instruction, and feedback so that we can understand what strategies work best for different model sizes and problem difficulties. - Multi-turn code generation findings reinforce insights from previous pape

Weaknesses

- **Limited novelty**: There is substantial existing research on CoT and self-repair in the code domain, which somewhat limits the paper's originality. - COT-Retry prompts seem like an arbitrary combination of prompts. Why this particular combination? The paper lacks any chain of thought prompting specific to reasoning about why the test failed or how to fix the test. - For additive reasoning prompts experiment in table 6, what happens if each sample uses a different COT prompt? Some of th

Reviewer 03Rating 8Confidence 3

Strengths

1.Comprehensive evaluation of CoT prompting strategies across different model sizes and benchmarks. I found the negative results that Reasoning prompts not additive interesting. 2.Usage of pass n@k metric for fairer comparison between single-turn and multi-turn approaches. 3.Thorough analysis of the impact of different error-feedback granularities. 4.Investigation of fine-tuning on multi-turn CoT data and its effects on model behavior, showing effectiveness to internalize CoT steps for LLMs.

Weaknesses

1.Although this work has solid empirical evaluations and good presentations, the novelty is a bit limited by combining existing works together, including CoT strategies, error-feedback granularities, and Pass n@k. The core contribution is primarily a prompt engineering effort combined with the evaluation of existing LLM capabilities in code generation. While valuable, this doesn't represent a significant theoretical or methodological advancement in the field. As for the RFT part, although I thin

Videos

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Model-Driven Software Engineering Techniques · Topic Modeling

MethodsFocus