Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP
Martin Vogel, Falk Meyer-Eschenbach, Severin Kohler, Elias Gr\"unewald, Felix Balzer

TL;DR
Codebase-Memory is an open-source system that builds a persistent, language-agnostic knowledge graph from codebases using Tree-Sitter, improving LLM code exploration efficiency and accuracy.
Contribution
It introduces a novel Tree-Sitter-based knowledge graph construction method via MCP, enabling efficient, multi-language code understanding for LLM agents.
Findings
Achieves 83% answer quality with fewer tokens and tool calls.
Matches or exceeds explorer on hub detection and caller ranking in most languages.
Supports 66 languages through a multi-phase, parallel pipeline.
Abstract
Large Language Model (LLM) coding agents typically explore codebases through repeated file-reading and grep-searching, consuming thousands of tokens per query without structural understanding. We present Codebase-Memory, an open-source system that constructs a persistent, Tree-Sitter-based knowledge graph via the Model Context Protocol (MCP), parsing 66 languages through a multi-phase pipeline with parallel worker pools, call-graph traversal, impact analysis, and community discovery. Evaluated across 31 real-world repositories, Codebase-Memory achieves 83% answer quality versus 92% for a file-exploration agent, at ten times fewer tokens and 2.1 times fewer tool calls. For graph-native queries such as hub detection and caller ranking, it matches or exceeds the explorer on 19 of 31 languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
