GREPO: A Benchmark for Graph Neural Networks on Repository-Level Bug Localization
Juntong Wang, Libin Chen, Xiyuan Wang, Shijia Kang, Haotong Yang, Da Zheng, Muhan Zhang

TL;DR
GREPO introduces a new benchmark dataset for applying Graph Neural Networks to repository-level bug localization, demonstrating their effectiveness over traditional retrieval methods.
Contribution
This paper presents GREPO, the first dedicated GNN benchmark for large-scale bug localization, including a dataset of 86 repositories and 47,294 tasks, enabling future research in this area.
Findings
GNNs outperform traditional retrieval baselines in bug localization.
GREPO provides a scalable, repository-level dataset for GNN research.
GNN architectures show promising results on real-world software repositories.
Abstract
Repository-level bug localization-the task of identifying where code must be modified to fix a bug-is a critical software engineering challenge. Standard Large Language Modles (LLMs) are often unsuitable for this task due to context window limitations that prevent them from processing entire code repositories. As a result, various retrieval methods are commonly used, including keyword matching, text similarity, and simple graph-based heuristics such as Breadth-First Search. Graph Neural Networks (GNNs) offer a promising alternative due to their ability to model complex, repository-wide dependencies; however, their application has been hindered by the lack of a dedicated benchmark. To address this gap, we introduce GREPO, the first GNN benchmark for repository-scale bug localization tasks. GREPO comprises 86 Python repositories and 47294 bug-fixing tasks, providing graph-based data…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The approach described in the paper to leverage GNNs for the bug localization task is very detailed and carefully designed. 2. The ablation experiments are pretty comprehensive and explore a variety of methodological choices. 3. The results, shown a subset of the GREPO benchmark, look very impressive.
While the experimental results presented by the authors do look very promising, below are reasons why I am not convinced yet of the author’s claims that GNNs will be able to outperform information retrieval (IR) approaches 1. **Inadequate Baselines**: While the authors claims to compare against IR approaches, they haven’t considered any embedding-specific or retrieve-and-rerank style methods. I would suggest to compare against the SweRank approach [1], which uses a code embedding model for fun
1.The research motivation is clear, and the overall idea is meaningful. 2.The ablation experiments validate the effectiveness of methodological design 3.GREPO provides a valuable dataset resource for graph-based software repository analysis
1.Unclear methodological exposition: The introduction clearly states the research motivation, but the subsequent sections fail to describe the motivation and construction process of the benchmark in a coherent way. As a result, the paper reads as fragmented and lacks a continuous narrative. 2.Definitions and formulas lack rigor: The concepts of anchor nodes, similarity features, and subgraph extraction are insufficiently described. In Figure 1 (page 4), anchor-related content is not mentioned at
1. The creation of a large, pre-processed, and graph-ready benchmark (GREPO) is a substantial engineering effort. Such a resource is valuable and can lower the barrier to entry for researchers wanting to apply GNNs to this domain. 2. The pipeline for constructing the dataset, including the use of a temporal graph to efficiently handle different commit snapshots and the careful collection of high-quality labels from pull requests and issues, is well-designed. 3. The authors have made the code o
1. The primary contribution of this work is to be the construction of a new dataset. Given that the main novelty lies in the dataset and its empirical findings, the paper might be a better fit for a conference with a dedicated "Datasets and Benchmarks" track or a top-tier software engineering venue (e.g., ICSE, FSE, ASE), where the contribution would be more prominently highlighted. For a venue like ICLR, the innovation seems limited. 2. The performance of the AgentLess baseline is perplexing an
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Topic Modeling
