Theoretical and Methodological Framework for Studying Texts Produced by Large Language Models
Ji\v{r}\'i Mili\v{c}ka

TL;DR
This paper proposes a theoretical and methodological framework for analyzing texts generated by large language models from a quantitative linguistics perspective, emphasizing non-anthropomorphic approaches and the potential for studying human culture.
Contribution
It introduces a conceptual framework distinguishing LLMs and their simulated entities, advocating for non-anthropomorphic analysis and expanding the study of LLMs' texts within linguistic theory.
Findings
Framework differentiates LLMs and simulated entities.
Highlights the importance of non-anthropomorphic analysis.
Suggests LLMs as tools for studying human culture.
Abstract
This paper addresses the conceptual, methodological and technical challenges in studying large language models (LLMs) and the texts they produce from a quantitative linguistics perspective. It builds on a theoretical framework that distinguishes between the LLM as a substrate and the entities the model simulates. The paper advocates for a strictly non-anthropomorphic approach to models while cautiously applying methodologies used in studying human linguistic behavior to the simulated entities. While natural language processing researchers focus on the models themselves, their architecture, evaluation, and methods for improving performance, we as quantitative linguists should strive to build a robust theory concerning the characteristics of texts produced by LLMs, how they differ from human-produced texts, and the properties of simulated entities. Additionally, we should explore the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsFocus
