ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments
Hojae Han, Seung-won Hwang, Rajhans Samdani, Yuxiong He

TL;DR
ConvCodeWorld introduces a comprehensive benchmark environment for evaluating conversational code generation in interactive settings, emphasizing diverse feedback types and their impact on LLM performance.
Contribution
This work presents novel reproducible benchmarks, CONVCODEWORLD and CONVCODEBENCH, for assessing LLMs in multi-turn, feedback-rich code generation scenarios, addressing limitations of existing benchmarks.
Findings
LLM performance varies significantly with feedback type.
Weaker LLMs with feedback can outperform stronger models without feedback.
Training on specific feedback types can limit adaptability to new feedback.
Abstract
Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions, limiting our ability to evaluate LLMs in these contexts. To address this gap, we present a set of novel benchmarks that explicitly model the quality of feedback provided to code generation LLMs. Our contributions are threefold: First, we introduce CONVCODEWORLD, a novel and reproducible environment for benchmarking interactive code generation. CONVCODEWORLD simulates 9 distinct interactive code generation scenarios while systematically combining three types of feedback: (a) compilation feedback; (b) execution feedback with varying test coverage; (c) verbal feedback generated by GPT-4o with different levels of expertise. Second, we introduce CONVCODEBENCH, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Artificial Intelligence in Healthcare and Education
MethodsSparse Evolutionary Training
