Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering
Yoseph Berhanu Alebachew, Hunter Leary, Swanand Vaishampayan, Chris Brown

TL;DR
This paper introduces StackRepoQA, a new dataset for repository-level question answering on Java projects, and evaluates LLMs' performance, revealing limitations in genuine reasoning and the impact of structural information.
Contribution
It presents the first repository-scale QA dataset and systematic evaluation of LLMs on real-world software projects, highlighting current challenges and future directions.
Findings
LLMs achieve moderate accuracy on repository-level QA.
Structural signals improve LLM performance.
High scores often stem from answer memorization rather than reasoning.
Abstract
Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering (QA). However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the challenges of real-world program comprehension, which often spans multiple files and system-level dependencies. In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects. Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
