Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models
Sergio E. Zanotto, Segun Aroyehun

TL;DR
This study characterizes human and machine-generated texts using linguistic features across multiple levels, revealing differences in syntactic simplicity, semantic diversity, and stylistic variability, with recent models producing more homogenized outputs.
Contribution
It introduces a comprehensive linguistic profiling approach to distinguish human and LLM texts, analyzing variability across models, domains, and sampling strategies.
Findings
Human texts have simpler syntax and more diverse semantics.
Machine texts show less stylistic variability, especially in newer models.
Both types exhibit domain-dependent stylistic diversity.
Abstract
The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written or machine-generated texts, our study focuses on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality, and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls, and model release dates. Our statistical analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAuthorship Attribution and Profiling · Topic Modeling · Mental Health via Writing
