TL;DR
SaraCoder introduces a resource-efficient retrieval augmentation method that enhances repository-level code completion by maximizing information diversity and relevance through hierarchical feature optimization and structural analysis.
Contribution
It presents a novel resource-optimized retrieval augmentation framework with modules for semantic refinement, structural similarity assessment, and symbol disambiguation, improving code completion accuracy.
Findings
Outperforms existing baselines on CrossCodeEval and RepoEval-Updated benchmarks.
Effectively handles cross-file symbol ambiguity.
Enhances code completion across multiple programming languages.
Abstract
Despite Retrieval-Augmented Generation improving code completion, traditional retrieval methods struggle with information redundancy and a lack of diversity within limited context windows. To solve this, we propose a resource-optimized retrieval augmentation method, SaraCoder. It maximizes information diversity and representativeness in a limited context window, significantly boosting the accuracy and reliability of repository-level code completion. Its core Hierarchical Feature Optimization module systematically refines candidates by distilling deep semantic relationships, pruning exact duplicates, assessing structural similarity with a novel graph-based metric that weighs edits by their topological importance, and reranking results to maximize both relevance and diversity. Furthermore, an External-Aware Identifier Disambiguator module accurately resolves cross-file symbol ambiguity…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Comprehensive system design: The paper systematically covers multiple aspects of retrieval-based code completion — semantic similarity, structure, redundancy, and external symbol handling. - Practical focus: Addresses real-world concerns such as resource constraints and redundant retrievals in repo-level code completion.
- Limited novelty: Each module (semantic filtering, code graph construction, redundancy reducing, reranking) has been well-studied in prior research work. The paper seems to combines known components without a new conceptual insight or methodological breakthrough. - Weak significance: The improvements are incremental and largely driven by EAID, which itself is a straightforward dependency lookup. The other “hierarchical optimization” modules add marginal effect. - Benchmark ambiguity: “RepoEva
1. The paper is well written and easy to follow. 2. The improvement of EM and ES on two datasets demonstrates the effectiveness of SaraCoder.
1. The motivation of this paper is doubtful. According to the authors, one of the problems in current RAG systems is that the retrieved codes are highly similar to each other. To solve this problem, the authors propose HF_OP, which includes redundancy elimination and diversity-aware reranking. And the authors give an illustrative example in Figure 1(b). However, I doubt the frequency of this phenomenon. Usually, if a project is well-designed, developers seldom reinvent the wheel. The authors sho
The paper focuses inefficient context utilization in retrieval-augmented generation, which is a timely and practical issue in repository-level code completion. The motivation is sound: reducing redundancy and increasing information diversity within constrained context windows can improve both model performance and system efficiency.
1. Clarity and Organization: Key concepts such as program slicing, query graph, and candidate graph are introduced without sufficient detail in the main body, and there is no clear cross-reference to their definitions in the appendix. Conversely, details like the use of GraphCodeBERT or MD5 hashing are elaborated unnecessarily in the main body, distracting from the high-level contributions. 2. System Complexity and Focus: SaraCoder comprises numerous components, making it more suitable for soft
+ The proposed hierarchical optimization pipeline is well-motivated and systematically integrates multiple complementary criteria (semantic alignment, redundancy reduction, structural similarity, and diversity). -The paper is clearly written, and the methodology is easy to follow
My main concerns are as follows: 1. **Limited conceptual novelty.** The paper is largely incremental. It stacks multiple retrieval and context-filtering modules (semantic filtering, reranking, enhancement, etc.) on top of an existing RAG framework. While the overall pipeline is systematic, similar ideas have already been explored in prior works. The paper lacks clear theoretical insight or conceptual novelty beyond combining known heuristics. 2. **Limited empirical significance and questi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
