Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction
Alexander Smirnov

TL;DR
This paper reveals that AI text detectors primarily amplify a pretrained typicality axis rather than learning a clear boundary between AI and human texts, challenging assumptions about their core mechanism.
Contribution
It demonstrates that raw encoder projections outperform fine-tuned models in detecting AI-generated text and introduces a universal Jacobian predictor for axis manipulation interventions.
Findings
Raw encoder projections achieve high AUROC scores, often surpassing fine-tuned models.
A simple frozen probe matches full fine-tuning performance.
Interventions based on the Jacobian predictor significantly improve detection metrics.
Abstract
AI text detectors amplify a pretrained typicality axis; they do not construct an AI-vs-human boundary. On raw encoders before any task supervision, projecting onto centroid(AI)-centroid(HC3) achieves NYT-vs-HC3 AUROC 0.806/0.944/0.834 across three architectures (86-106% of the fine-tuned discrimination ceiling: on RoBERTa-base, raw projection exceeds fine-tuning); on RoBERTa-base, full fine-tuning reduces discrimination below raw on both fluent-formal populations tested. The same axis inverts on non-native ESL writing (AUROC 0.06-0.20) -- a falsifiable prediction unique to the typicality reading. A 24-example frozen probe matches full fine-tuning (0.900 vs 0.895). A closed-form Jacobian predictor parameterises axis-manipulating interventions with R^2 = 1.000 universal, lifts ELECTRA-CE deployment TPR from 0.000 to 0.904 at FPR = 1%, and transfers to three independently-trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
