The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices
Esteban Garces Arias, Nurzhan Sapargali, Christian Heumann, Matthias A{\ss}enmacher

TL;DR
This paper reveals that current likelihood-based decoding strategies in text generation exclude human-like tokens, making machine-generated text more detectable, and shows that adjusting decoding parameters affects detectability more than model size.
Contribution
It identifies the truncation blind spot in decoding strategies and demonstrates its impact on text detectability, highlighting the importance of decoding choices over model scale.
Findings
8-18% of human tokens are outside typical truncation boundaries.
Simple classifiers can detect machine-generated text with high accuracy.
Decoding parameters influence detectability more than model size or architecture.
Abstract
Standard decoding strategies for text generation, including top-k, nucleus sampling, and contrastive search, select tokens based on likelihood, restricting selection to high-probability regions. Human language production operates differently: tokens are chosen for communicative appropriateness rather than statistical frequency. This mismatch creates a truncation blind spot: contextually appropriate but statistically rare tokens remain accessible to humans yet unreachable by likelihood-based decoding. We hypothesize this contributes to the detectability of machine-generated text. Analyzing over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations, we find that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers trained on predictability and lexical diversity achieve remarkable detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Language and cultural evolution · Natural Language Processing Techniques
