Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts
Luxu Liang, Xiang Li

TL;DR
This paper introduces Steer-to-Detect, a two-stage framework that enhances detection of LLM-generated texts by steering hidden representations, with theoretical guarantees and strong empirical performance.
Contribution
The paper proposes a novel steering-based method for LLM detection, improving class separability and providing theoretical error guarantees.
Findings
S2D achieves strong detection performance across various scenarios.
The method provides finite-sample, high-probability guarantees for errors.
Empirical results show robustness to out-of-distribution and adversarial attacks.
Abstract
The rapid advancement of large language models (LLMs) has made machine-generated text increasingly difficult to distinguish from human-written text. While recent studies explore leveraging internal representations of language models to uncover deeper detection signals, these raw features often exhibit substantial overlap between classes, limiting their discriminative power. To address this challenge, we propose Steer-to-Detect (\texttt{S2D}), a two-stage framework for detecting LLM-generated text. In the first stage, \texttt{S2D} learns a steering vector that is injected into the hidden states of a frozen observer LLM, producing representations with improved class separability. In the second stage, detection is performed via a hypothesis testing procedure based on the steered representations. We establish finite-sample, high-probability guarantees for Type I and Type II errors,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
