Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content

Simon M\"unker; Nils Schwager; Kai Kugler; Michael Heseltine; Achim Rettinger

arXiv:2602.19177·cs.CL·February 24, 2026

Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content

Simon M\"unker, Nils Schwager, Kai Kugler, Michael Heseltine, Achim Rettinger

PDF

Open Access 1 Models

TL;DR

This paper introduces a reply prediction dataset from Twitter to evaluate linguistic differences between LLM-generated and human content, emphasizing the need for better prompting to improve research validity.

Contribution

It presents a novel reply prediction task and dataset to assess linguistic discrepancies in LLM outputs compared to human data in social science research.

Findings

01

LLMs exhibit significant stylistic and content discrepancies from human data.

02

Current naive prompting techniques often produce less authentic linguistic patterns.

03

Enhanced prompting and specialized datasets are necessary for valid social science applications.

Abstract

The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nsschw/NRPX-Qwen3-8B-v1-eng
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Mental Health via Writing · Authorship Attribution and Profiling