AOCI: Symbolic-Semantic Indexing for Practical Repository-Scale Code Understanding with LLMs
Jinshi Liu, Hanying Zuo, Congyin Cao, Anran Zhang, Yixuan Liu, Xinzhou Xie

TL;DR
AOCI introduces a symbolic-semantic indexing method that provides a stable, comprehensive repository-level blueprint for large codebases, enhancing LLM understanding and maintenance efficiency.
Contribution
This paper presents AOCI, a systematic symbolic-semantic repository representation that improves large-scale code understanding with LLMs, outperforming existing methods in accuracy and efficiency.
Findings
AOCI outperforms all baseline methods in accuracy across evaluations.
AOCI produces zero defects on industrial tasks, unlike some agent-based tools.
AOCI reduces token consumption significantly in complex tasks.
Abstract
Large language models struggle with understanding codebases beyond a certain scale -- repositories with hundreds of thousands of lines of code. Existing methods -- retrieval, summarization, agent exploration -- each construct a different view at query time. The view varies between runs, and what persists is typically ad-hoc rather than systematic. This paper introduces AOCI (AI-Oriented Code Indexing): a symbolic-semantic repository representation -- a structured blueprint that an LLM can read in a single pass to gain a complete repository-level picture of the system's architecture, dependencies, and key design decisions before any task. An AOCI index consists of encoding rules followed by entries, with one entry per code unit (file or database table). Each entry pairs a symbolic tag with semantic content. The symbolic component provides architectural coordinates; the semantic component…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
