SWE-QA: A Dataset and Benchmark for Complex Code Understanding
La\"ila Elkoussy (LRE, EPITA), Julien Perez (EPITA, LRE)

TL;DR
SWE-QA introduces a comprehensive dataset and benchmark for evaluating complex, multi-hop code understanding in Python, highlighting the challenges faced by current language models in real-world software reasoning.
Contribution
The paper presents a novel dataset and benchmark that focus on multi-hop reasoning in code comprehension, bridging the gap between simple tasks and real-world software development complexity.
Findings
Current language models struggle with multi-hop reasoning in code understanding.
Dense architectures outperform mixture-of-experts models by 10-14 percentage points.
Best model achieves 74.41% accuracy on the SWE-QA benchmark.
Abstract
In this paper, we introduce SWE-QA, a text and code corpus aimed at benchmarking multi-hop code comprehension, addressing the gap between simplified evaluation tasks and the complex reasoning required in real-world software development. While existing code understanding benchmarks focus on isolated snippets, developers must routinely connect information across multiple dispersed code segments. The dataset comprises 9,072 multiple-choice questions systematically generated from 12 Python repositories of SWE-bench, evaluating several recurrent reasoning patterns like Declaration-and-Call questions that link entity definitions to their usage, and Interacting-Entity questions that examine the dynamic relationships among multiple collaborating components. Generated through parsing-based entity extraction and Large Language Model assisted question construction with carefully validated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
