ConvCodeWorld: Benchmarking Conversational Code Generation in   Reproducible Feedback Environments

Hojae Han; Seung-won Hwang; Rajhans Samdani; Yuxiong He

arXiv:2502.19852·cs.SE·February 28, 2025

ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments

Hojae Han, Seung-won Hwang, Rajhans Samdani, Yuxiong He

PDF

Open Access

TL;DR

ConvCodeWorld introduces a comprehensive benchmark environment for evaluating conversational code generation in interactive settings, emphasizing diverse feedback types and their impact on LLM performance.

Contribution

This work presents novel reproducible benchmarks, CONVCODEWORLD and CONVCODEBENCH, for assessing LLMs in multi-turn, feedback-rich code generation scenarios, addressing limitations of existing benchmarks.

Findings

01

LLM performance varies significantly with feedback type.

02

Weaker LLMs with feedback can outperform stronger models without feedback.

03

Training on specific feedback types can limit adaptability to new feedback.

Abstract

Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions, limiting our ability to evaluate LLMs in these contexts. To address this gap, we present a set of novel benchmarks that explicitly model the quality of feedback provided to code generation LLMs. Our contributions are threefold: First, we introduce CONVCODEWORLD, a novel and reproducible environment for benchmarking interactive code generation. CONVCODEWORLD simulates 9 distinct interactive code generation scenarios while systematically combining three types of feedback: (a) compilation feedback; (b) execution feedback with varying test coverage; (c) verbal feedback generated by GPT-4o with different levels of expertise. Second, we introduce CONVCODEBENCH, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software Engineering Research · Artificial Intelligence in Healthcare and Education

MethodsSparse Evolutionary Training