Can Humans Tell? A Dual-Axis Study of Human Perception of LLM-Generated News
Alexander Loth, Martin Kappes, Marc-Oliver Pahl

TL;DR
Humans are generally unable to reliably distinguish between human-written and LLM-generated news articles, regardless of model size or user expertise, highlighting challenges in detection and the need for system-level solutions.
Contribution
This study introduces JudgeGPT, a platform for measuring human ability to attribute news source and assesses detection difficulty across multiple LLMs and user profiles.
Findings
Participants cannot reliably distinguish machine from human text (p > .05).
Detection difficulty persists across all tested LLM sizes, including 7B models.
Self-reported expertise correlates with better judgment accuracy, political orientation does not.
Abstract
Can humans tell whether a news article was written by a person or a large language model (LLM)? We investigate this question using JudgeGPT, a study platform that independently measures source attribution (human vs. machine) and authenticity judgment (legitimate vs. fake) on continuous scales. From 2,318 judgments collected from 1,054 participants across content generated by six LLMs, we report five findings: (1) participants cannot reliably distinguish machine-generated from human-written text (p > .05, Welch's t-test); (2) this inability holds across all tested models, including open-weight models with as few as 7B parameters; (3) self-reported domain expertise predicts judgment accuracy (r = .35, p < .001) whereas political orientation does not (r = -.10, n.s.); (4) clustering reveals distinct response strategies ("Skeptics" vs. "Believers"); and (5) accuracy degrades after…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
