When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback

Jane Pan; Ryan Shar; Jacob Pfau; Ameet Talwalkar; He He; Valerie Chen

arXiv:2502.18413·cs.HC·February 26, 2025

When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback

Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, Valerie Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces an interactive evaluation method for code LLMs that assesses how models incorporate user feedback during collaboration, revealing performance variations and behavioral impacts not captured by static benchmarks.

Contribution

It presents a novel interactive evaluation pipeline that perturbs static benchmarks to simulate user interactions, providing insights into model behavior and performance in collaborative coding scenarios.

Findings

01

Interaction significantly alters model rankings across datasets.

02

Models are robust to feedback containing errors.

03

Feedback type influences model responses and edit priorities.

Abstract

Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine how LLMs incorporate different types of feedback in a collaborative setting. Specifically, we perturb static coding benchmarks so that the code model must interact with a simulated user to retrieve key information about the problem. We find that interaction significantly affects model performance, as the relative rankings of 10 models across 3 datasets often vary between static and interactive settings, despite models being fairly robust to feedback that contains errors. We also observe that even when different feedback types are equally effective with respect to performance, they can impact model behaviors such as (1) how models respond to higher- vs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

janepan9917/whenbenchmarkstalk
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques