RubberDuckBench: A Benchmark for AI Coding Assistants

Ferida Mohammed; Fatma Ayad; Petros Maniatis; Satish Chandra; Elizabeth Dinella

arXiv:2601.16456·cs.SE·May 6, 2026

RubberDuckBench: A Benchmark for AI Coding Assistants

Ferida Mohammed, Fatma Ayad, Petros Maniatis, Satish Chandra, Elizabeth Dinella

PDF

TL;DR

RubberDuckBench is a multilingual benchmark of real-world questions from GitHub, designed to evaluate AI coding assistants, revealing current models' limitations in accuracy, consistency, and hallucination rates.

Contribution

This work introduces RubberDuckBench, a new benchmark with detailed rubrics for evaluating AI coding assistants on real-world, multilingual code-related questions.

Findings

01

State-of-the-art models perform poorly on the benchmark.

02

Most models only answer a few questions completely correctly.

03

Models frequently hallucinate, with lies in 58.3% of responses.

Abstract

Programmers are turning to AI coding assistants to answer questions about their code. Benchmarks are needed to soundly evaluate these systems and understand their performance. To enable such a study, we curate a benchmark of real-world contextualized questions derived from Github pull request comments. Out of this work, we present RubberDuckBench: a multilingual benchmark of questions about code, along with detailed rubrics for evaluating answers. We evaluate a diverse set of 20 LLMs (proprietary & open-source) on answering these questions. We find that even state of the art models fail to give consistent, correct responses across the benchmark. Grok 4 (69.29%), Claude Opus 4 (68.5%), and GPT-5 (67.8%) perform best overall, but do not exhibit pairwise significant superiority over the next 9 best performing models. Most models obtain points through partial credit, with the best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.