Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)

Rafael Rivera Soto; Barry Chen; Nicholas Andrews

arXiv:2505.14608·cs.CL·September 30, 2025

Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)

Rafael Rivera Soto, Barry Chen, Nicholas Andrews

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the robustness of stylistic features in detecting machine-generated text, showing that models optimized to evade detectors still exhibit detectable stylistic traits, especially with multiple samples.

Contribution

It introduces a stylistic feature space that remains robust against adversarial optimization and proposes a paraphrasing attack to test detector resilience.

Findings

01

Stylistic features can reliably detect optimized machine text.

02

Detection remains effective with multiple samples, despite adversarial attacks.

03

Single-sample detection is vulnerable to paraphrasing attacks.

Abstract

Despite considerable progress in the development of machine-text detectors, it has been suggested that the problem is inherently hard, and therefore, that stakeholders should proceed under the assumption that machine-generated text cannot be reliably detected as such. We examine a recent such claim by Nicks et al. (2024) regarding the ease with which language models can be optimized to degrade the performance of machine-text detectors, including detectors not specifically optimized against. We identify a feature space -- the stylistic feature space -- that is robust to such optimization, and show that it may be used to reliably detect samples from language models optimized to prevent detection. Furthermore, we show that even when models are explicitly optimized against stylistic detectors, detection performance remains surprisingly unaffected. We then seek to understand if stylistic…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

- The paper provides strong empirical evidence that style-based detection methods remain robust even after preference tuning aimed at fooling detectors. - The style transfer experiments are valuable and demonstrate practical potential for improving the naturalness and “humanization” of LLM-generated text, particularly for applications like chatbots. - Overall, the work offers a well-executed empirical study that contributes both to improving the human-likeness of generated text and to understa

Weaknesses

The weaknesses are minor, mostly text-based. - There could be a more in-depth intro to the style-based detectors, for example, explaining what is the style feature space. - The paper could benefit from studying larger LLMs, say up to 32B, without fine-tuning, to give a perspective on the limits of the style-based detection applicability.

Reviewer 02Rating 2Confidence 4

Strengths

- The results of the experiments conducted by the authors demonstrate that their approach outperforms other paraphrasing attacks at avoiding automatic detection. - The experiments cover rather underexplored setup of detection AI-generated content by analyzing multiple text samples from the same author. - Limitations of the proposed method are well outlined.

Weaknesses

- The fact that existing methods for AI-generated texts are very brittle to paraphrasing is quite well-known; this paper does not provide any novelty in this regard. Essentially, the main contribution lies in applying existing methods to transfer the style of a human author to machine generated texts. - Main claims of the paper contradict each other: "although LLMs can be optimized to defeat machine-text detectors, they remain identifiable by detectors that avail of writing style and that moreo

Reviewer 03Rating 2Confidence 3

Strengths

The proposed humaniser (as described in section 3.1) is an interesting, simple idea. The observation that methods to evade log-likelihood based detectors do not evade style based detectors is not surprising, but it is good to see it rigorously shown (one would be very surprised if style based detectors were secretly exploiting artefacts which are visible to log-likelihood based detectors, but it is still good that the authors have ruled this out).

Weaknesses

There are several minor weaknesses which I mention in the questions section below. Throughout the manuscript the authors use 'detectors' as a proxy for 'log-likelihood based detectors of machine generated text'. There are a lot of clever detectors incorporating different signals than pure log-likelihood (e.g. looking at patterns in log likelihood rather than average log-likelihood), I don't think you need to use all of these different detectors as baselines but I do think you should avoid writi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods