Code-Survey: An LLM-Driven Methodology for Analyzing Large-Scale   Codebases

Yusheng Zheng; Yiwei Yang; Haoqin Tu; Yuxi Huang

arXiv:2410.01837·cs.SE·October 4, 2024

Code-Survey: An LLM-Driven Methodology for Analyzing Large-Scale Codebases

Yusheng Zheng, Yiwei Yang, Haoqin Tu, Yuxi Huang

PDF

Open Access

TL;DR

Code-Survey is an innovative LLM-driven methodology that systematically analyzes large-scale codebases by transforming unstructured data into structured datasets, enabling detailed insights into software evolution, design, and security.

Contribution

It introduces a novel LLM-based survey approach that treats development artifacts as social data, allowing structured analysis of complex, evolving software systems like the Linux kernel.

Findings

01

Uncovered development patterns and feature interdependencies in Linux eBPF

02

Validated insights with domain experts

03

Demonstrated applicability to other large-scale projects

Abstract

Modern software systems like the Linux kernel are among the world's largest and most intricate codebases, continually evolving with new features and increasing complexity. Understanding these systems poses significant challenges due to their scale and the unstructured nature of development artifacts such as commits and mailing list discussions. We introduce Code-Survey, the first LLM-driven methodology designed to systematically explore and analyze large-scale codebases. The central principle behind Code-Survey is to treat LLMs as human participants, acknowledging that software development is also a social activity and thereby enabling the application of established social science techniques. By carefully designing surveys, Code-Survey transforms unstructured data, such as commits, emails, into organized, structured, and analyzable datasets. This enables quantitative analysis of complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques