Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery
Jia Yu, Weiwei Yu, Pengfei Xiao, Fukun Xing

TL;DR
This paper introduces an agent-driven framework where large language models autonomously conduct corpus linguistics research, generating hypotheses, querying data, and interpreting results with minimal human intervention.
Contribution
It presents a novel approach integrating LLMs with corpus query engines, enabling automated linguistic discovery while maintaining verifiable evidence and interpretability.
Findings
Agent identified diachronic semantic chains in corpus data.
The framework demonstrated high quantitative agreement with published studies.
Corpus grounding enhances model falsifiability and empirical validity.
Abstract
Corpus linguistics has traditionally relied on human researchers to formulate hypotheses, construct queries, and interpret results - a process demanding specialized technical skills and considerable time. We propose Agent-Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool-use interface, takes over the investigative cycle: generating hypotheses, querying the corpus, interpreting results, and refining analysis across multiple rounds. The human researcher sets direction and evaluates final output. Unlike unconstrained LLM generation, every finding is anchored in verifiable corpus evidence. We treat this not as a replacement for the corpus-based/corpus-driven distinction but as a complementary dimension: it concerns who conducts the inquiry, not the epistemological relationship between theory and data. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
