Do Not Treat Code as Natural Language: Implications for Repository-Level Code Generation and Beyond
Minh Le-Anh, Huyen Nguyen, Khanh An Tran, Nam Le Hai, Linh Ngo Van, Nghi D.Q. Bui, Bach Le

TL;DR
This paper introduces Hydra, a structure-aware, dependency-focused retrieval framework for repository-level code generation that significantly improves performance over existing methods by preserving code structure and dependencies.
Contribution
Hydra presents a novel hierarchy-based indexing and dependency-aware retrieval approach tailored for repository-level code generation, addressing limitations of traditional NLP-inspired methods.
Findings
Hydra achieves over 5% improvement in Pass@1 on DevEval and RepoExec benchmarks.
Hydra enables smaller models to match or outperform larger models with existing retrieval methods.
The approach significantly enhances code generation quality by preserving structural dependencies.
Abstract
Large language models for code (CodeLLMs) have demonstrated remarkable success in standalone code completion and generation, sometimes even surpassing human performance, yet their effectiveness diminishes in repository-level settings where cross-file dependencies and structural context are essential. Existing Retrieval-Augmented Generation (RAG) approaches often borrow strategies from NLP, relying on chunking-based indexing and similarity-based retrieval. Chunking results in the loss of coherence between code units and overlooks structural relationships, while similarity-driven methods frequently miss functionally relevant dependencies such as helper functions, classes, or global variables. To address these limitations, we present Hydra, a repository-level code generation framework that treats code as structured code rather than natural language. Our approach introduces (i) a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Scientific Computing and Data Management
