SWE-QA: A Dataset and Benchmark for Complex Code Understanding

La\"ila Elkoussy (LRE; EPITA); Julien Perez (EPITA; LRE)

arXiv:2604.24814·cs.SE·April 29, 2026

SWE-QA: A Dataset and Benchmark for Complex Code Understanding

La\"ila Elkoussy (LRE, EPITA), Julien Perez (EPITA, LRE)

PDF

TL;DR

SWE-QA introduces a comprehensive dataset and benchmark for evaluating complex, multi-hop code understanding in Python, highlighting the challenges faced by current language models in real-world software reasoning.

Contribution

The paper presents a novel dataset and benchmark that focus on multi-hop reasoning in code comprehension, bridging the gap between simple tasks and real-world software development complexity.

Findings

01

Current language models struggle with multi-hop reasoning in code understanding.

02

Dense architectures outperform mixture-of-experts models by 10-14 percentage points.

03

Best model achieves 74.41% accuracy on the SWE-QA benchmark.

Abstract

In this paper, we introduce SWE-QA, a text and code corpus aimed at benchmarking multi-hop code comprehension, addressing the gap between simplified evaluation tasks and the complex reasoning required in real-world software development. While existing code understanding benchmarks focus on isolated snippets, developers must routinely connect information across multiple dispersed code segments. The dataset comprises 9,072 multiple-choice questions systematically generated from 12 Python repositories of SWE-bench, evaluating several recurrent reasoning patterns like Declaration-and-Call questions that link entity definitions to their usage, and Interacting-Entity questions that examine the dynamic relationships among multiple collaborating components. Generated through parsing-based entity extraction and Large Language Model assisted question construction with carefully validated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.