Code-Survey: An LLM-Driven Methodology for Analyzing Large-Scale Codebases
Yusheng Zheng, Yiwei Yang, Haoqin Tu, Yuxi Huang

TL;DR
Code-Survey is an innovative LLM-driven methodology that systematically analyzes large-scale codebases by transforming unstructured data into structured datasets, enabling detailed insights into software evolution, design, and security.
Contribution
It introduces a novel LLM-based survey approach that treats development artifacts as social data, allowing structured analysis of complex, evolving software systems like the Linux kernel.
Findings
Uncovered development patterns and feature interdependencies in Linux eBPF
Validated insights with domain experts
Demonstrated applicability to other large-scale projects
Abstract
Modern software systems like the Linux kernel are among the world's largest and most intricate codebases, continually evolving with new features and increasing complexity. Understanding these systems poses significant challenges due to their scale and the unstructured nature of development artifacts such as commits and mailing list discussions. We introduce Code-Survey, the first LLM-driven methodology designed to systematically explore and analyze large-scale codebases. The central principle behind Code-Survey is to treat LLMs as human participants, acknowledging that software development is also a social activity and thereby enabling the application of established social science techniques. By carefully designing surveys, Code-Survey transforms unstructured data, such as commits, emails, into organized, structured, and analyzable datasets. This enables quantitative analysis of complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
