CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Myeongsoo Kim; Shweta Garg; Baishakhi Ray; Varun Kumar; and Anoop Deoras

arXiv:2507.10646·cs.SE·January 16, 2026

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Myeongsoo Kim, Shweta Garg, Baishakhi Ray, Varun Kumar, and Anoop Deoras

PDF

Open Access 1 Video

TL;DR

CodeAssistBench (CAB) is a comprehensive, automated benchmark for evaluating multi-turn, project-specific programming assistance by large language models, highlighting significant performance gaps in realistic coding environments.

Contribution

Introduces CAB, the first large-scale, automated benchmark for multi-turn, project-grounded code assistance, enabling evaluation beyond traditional single-turn, isolated code tasks.

Findings

01

Models achieve 70-83% accuracy on Stack Overflow questions.

02

Models solve only 7.22-16.49% of issues in real-world repositories.

03

Current LLMs struggle with realistic, project-specific code assistance.

Abstract

Programming assistants powered by large language models have improved dramatically, yet existing benchmarks still evaluate them in narrow code-generation settings. Recent efforts such as InfiBench and StackEval rely on Stack Overflow questions and remain limited to single-turn interactions, manually curated data, and isolated snippets rather than full project environments. We introduce CodeAssistBench (CAB), the first benchmark for evaluating multi-turn, project-grounded programming assistance at scale. CAB automatically constructs datasets from GitHub issues tagged as questions, using an LLM-driven pipeline that filters noise, extracts runnable contexts, builds executable containers, and verifies environment correctness. This enables continuous, automated expansion across diverse repositories without manual intervention. Using CAB, we create a testbed of 3,286 real-world issues across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance· slideslive

Taxonomy

TopicsText Readability and Simplification · Topic Modeling · Natural Language Processing Techniques