Beyond Perplexity: Character Distribution Signatures and the MDTA Benchmark for AI Text Detection

Priyadarshan Narayanasamy; Swastik Agrawal; Klint Faber; Fardina Fathmiul Alam

arXiv:2605.01647·cs.CL·May 5, 2026

Beyond Perplexity: Character Distribution Signatures and the MDTA Benchmark for AI Text Detection

Priyadarshan Narayanasamy, Swastik Agrawal, Klint Faber, Fardina Fathmiul Alam

PDF

1 Repo 1 Datasets

TL;DR

This paper proposes character distribution signatures and the MDTA benchmark to improve AI text detection, especially where traditional log-probability methods face limitations due to model training techniques.

Contribution

It introduces a novel detection signal based on character patterns, provides a theoretical basis for human-AI divergence, and constructs a comprehensive benchmark dataset for evaluation.

Findings

01

Letter Distribution Score (LD-Score) has low correlation with perplexity-based methods.

02

LD-Score improves detection performance when combined with existing methods.

03

The MDTA benchmark includes over 640,000 samples across multiple models, domains, and adversarial strategies.

Abstract

Training-free AI text detection methods primarily rely on model log-probabilities, achieving strong performance through approaches like Binoculars and DNA-DetectLLM. However, these methods face a fundamental ceiling as models are optimized through RLHF to produce human-like probability distributions. We introduce an alternative detection signal based on character distribution signatures. We provide theoretical foundations showing that AI models, trained on massive domain-balanced corpora, approximate global character patterns while humans exhibit domain-specialized distributions, creating a "Wall of Separation" where human-AI divergence significantly exceeds AI-AI divergence. To enable systematic evaluation, we construct the Models-Domains-Temperatures-Adversarials (MDTA) benchmark comprising 642,274 prompt-aligned samples across 4 models, 5 domains, 3 temperature settings, and 3…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/nsp909/MDTA
github

Datasets

nsp909/MDTA
dataset· 186 dl
186 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.