SaraCoder: Orchestrating Semantic and Structural Cues for Resource-Optimized Repository-Level Code Completion

Xiaohan Chen; Zhongying Pan; Quan Feng; Yu Tian; Shuqun Yang; Mengru Wang; Lina Gong; Yuxia Geng; Piji Li; Xiang Chen

arXiv:2508.10068·cs.SE·October 14, 2025

SaraCoder: Orchestrating Semantic and Structural Cues for Resource-Optimized Repository-Level Code Completion

Xiaohan Chen, Zhongying Pan, Quan Feng, Yu Tian, Shuqun Yang, Mengru Wang, Lina Gong, Yuxia Geng, Piji Li, Xiang Chen

PDF

4 Reviews

TL;DR

SaraCoder introduces a resource-efficient retrieval augmentation method that enhances repository-level code completion by maximizing information diversity and relevance through hierarchical feature optimization and structural analysis.

Contribution

It presents a novel resource-optimized retrieval augmentation framework with modules for semantic refinement, structural similarity assessment, and symbol disambiguation, improving code completion accuracy.

Findings

01

Outperforms existing baselines on CrossCodeEval and RepoEval-Updated benchmarks.

02

Effectively handles cross-file symbol ambiguity.

03

Enhances code completion across multiple programming languages.

Abstract

Despite Retrieval-Augmented Generation improving code completion, traditional retrieval methods struggle with information redundancy and a lack of diversity within limited context windows. To solve this, we propose a resource-optimized retrieval augmentation method, SaraCoder. It maximizes information diversity and representativeness in a limited context window, significantly boosting the accuracy and reliability of repository-level code completion. Its core Hierarchical Feature Optimization module systematically refines candidates by distilling deep semantic relationships, pruning exact duplicates, assessing structural similarity with a novel graph-based metric that weighs edits by their topological importance, and reranking results to maximize both relevance and diversity. Furthermore, an External-Aware Identifier Disambiguator module accurately resolves cross-file symbol ambiguity…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

- Comprehensive system design: The paper systematically covers multiple aspects of retrieval-based code completion — semantic similarity, structure, redundancy, and external symbol handling. - Practical focus: Addresses real-world concerns such as resource constraints and redundant retrievals in repo-level code completion.

Weaknesses

- Limited novelty: Each module (semantic filtering, code graph construction, redundancy reducing, reranking) has been well-studied in prior research work. The paper seems to combines known components without a new conceptual insight or methodological breakthrough. - Weak significance: The improvements are incremental and largely driven by EAID, which itself is a straightforward dependency lookup. The other “hierarchical optimization” modules add marginal effect. - Benchmark ambiguity: “RepoEva

Reviewer 02Rating 2Confidence 2

Strengths

1. The paper is well written and easy to follow. 2. The improvement of EM and ES on two datasets demonstrates the effectiveness of SaraCoder.

Weaknesses

1. The motivation of this paper is doubtful. According to the authors, one of the problems in current RAG systems is that the retrieved codes are highly similar to each other. To solve this problem, the authors propose HF_OP, which includes redundancy elimination and diversity-aware reranking. And the authors give an illustrative example in Figure 1(b). However, I doubt the frequency of this phenomenon. Usually, if a project is well-designed, developers seldom reinvent the wheel. The authors sho

Reviewer 03Rating 2Confidence 4

Strengths

The paper focuses inefficient context utilization in retrieval-augmented generation, which is a timely and practical issue in repository-level code completion. The motivation is sound: reducing redundancy and increasing information diversity within constrained context windows can improve both model performance and system efficiency.

Weaknesses

1. Clarity and Organization: Key concepts such as program slicing, query graph, and candidate graph are introduced without sufficient detail in the main body, and there is no clear cross-reference to their definitions in the appendix. Conversely, details like the use of GraphCodeBERT or MD5 hashing are elaborated unnecessarily in the main body, distracting from the high-level contributions. 2. System Complexity and Focus: SaraCoder comprises numerous components, making it more suitable for soft

Reviewer 04Rating 4Confidence 4

Strengths

+ The proposed hierarchical optimization pipeline is well-motivated and systematically integrates multiple complementary criteria (semantic alignment, redundancy reduction, structural similarity, and diversity). -The paper is clearly written, and the methodology is easy to follow

Weaknesses

My main concerns are as follows: 1. **Limited conceptual novelty.** The paper is largely incremental. It stacks multiple retrieval and context-filtering modules (semantic filtering, reranking, enhancement, etc.) on top of an existing RAG framework. While the overall pipeline is systematic, similar ideas have already been explored in prior works. The paper lacks clear theoretical insight or conceptual novelty beyond combining known heuristics. 2. **Limited empirical significance and questi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.