Human Variability vs. Machine Consistency: A Linguistic Analysis of Texts Generated by Humans and Large Language Models
Sergio E. Zanotto, Segun Aroyehun

TL;DR
This study analyzes linguistic features of texts from humans and LLMs, revealing humans produce more variable, semantically rich, and emotionally complex texts, especially in less constrained genres, highlighting differences in cognitive and stylistic aspects.
Contribution
The paper introduces a novel linguistic profiling approach using 250 features and PCA to distinguish human and machine texts, emphasizing variability and content differences.
Findings
Humans show higher variability in linguistic features.
Human texts contain more semantic and emotional content.
Differences are more pronounced in less constrained genres.
Abstract
The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. Recent research has predominantly focused on using LLMs to classify text as either human-written or machine-generated. In our study, we adopt a different approach by profiling texts spanning four domains based on 250 distinct linguistic features. We select the M4 dataset from the Subtask B of SemEval 2024 Task 8. We automatically calculate various linguistic features with the LFTK tool and additionally measure the average syntactic depth, semantic similarity, and emotional content for each document. We then apply a two-dimensional PCA reduction to all the calculated features. Our analyses reveal significant differences between human-written texts and those generated by LLMs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsADaptive gradient method with the OPTimal convergence rate · Principal Components Analysis
