Beyond Perplexity: Character Distribution Signatures and the MDTA Benchmark for AI Text Detection
Priyadarshan Narayanasamy, Swastik Agrawal, Klint Faber, Fardina Fathmiul Alam

TL;DR
This paper proposes character distribution signatures and the MDTA benchmark to improve AI text detection, especially where traditional log-probability methods face limitations due to model training techniques.
Contribution
It introduces a novel detection signal based on character patterns, provides a theoretical basis for human-AI divergence, and constructs a comprehensive benchmark dataset for evaluation.
Findings
Letter Distribution Score (LD-Score) has low correlation with perplexity-based methods.
LD-Score improves detection performance when combined with existing methods.
The MDTA benchmark includes over 640,000 samples across multiple models, domains, and adversarial strategies.
Abstract
Training-free AI text detection methods primarily rely on model log-probabilities, achieving strong performance through approaches like Binoculars and DNA-DetectLLM. However, these methods face a fundamental ceiling as models are optimized through RLHF to produce human-like probability distributions. We introduce an alternative detection signal based on character distribution signatures. We provide theoretical foundations showing that AI models, trained on massive domain-balanced corpora, approximate global character patterns while humans exhibit domain-specialized distributions, creating a "Wall of Separation" where human-AI divergence significantly exceeds AI-AI divergence. To enable systematic evaluation, we construct the Models-Domains-Temperatures-Adversarials (MDTA) benchmark comprising 642,274 prompt-aligned samples across 4 models, 5 domains, 3 temperature settings, and 3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
