Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion

Yinghan Zhou; Juan Wen; Wanli Peng; Zhengxian Wu; Ziwei Zhang; Yiming Xue

arXiv:2508.15848·cs.CR·August 25, 2025

Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion

Yinghan Zhou, Juan Wen, Wanli Peng, Zhengxian Wu, Ziwei Zhang, Yiming Xue

PDF

4 Reviews

TL;DR

This paper introduces Self-Disguise Attack (SDA), a novel method enabling large language models to produce more human-like, detection-resistant text by actively disguising their output, thereby evading AI-generated text detectors efficiently.

Contribution

The paper proposes SDA, combining adversarial feature extraction and retrieval-based context optimization, to improve AIGT evasion with lower computational costs and preserved text quality.

Findings

01

SDA significantly reduces detection accuracy of AIGT detectors.

02

SDA maintains high quality of generated text.

03

SDA is effective across multiple LLMs and detection methods.

Abstract

AI-generated text (AIGT) detection evasion aims to reduce the detection probability of AIGT, helping to identify weaknesses in detectors and enhance their effectiveness and reliability in practical applications. Although existing evasion methods perform well, they suffer from high computational costs and text quality degradation. To address these challenges, we propose Self-Disguise Attack (SDA), a novel approach that enables Large Language Models (LLM) to actively disguise its output, reducing the likelihood of detection by classifiers. The SDA comprises two main components: the adversarial feature extractor and the retrieval-based context examples optimizer. The former generates disguise features that enable LLMs to understand how to produce more human-like text. The latter retrieves the most relevant examples from an external knowledge base as in-context examples, further enhancing…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

1. Using extracted features to guide generation is a good approach, that makes intuitive sense to evade detection 2. Using a large dataset to extract features makes this process automated and more reliable. 2. SDA shows better detection evasion results than existing AIGT methods and displays better generation quality

Weaknesses

1. Distinction form prior work: Computation benefits are listed as one of the main benefits over existing methods, but no statistics are given to support the claim 2. Lack of clarity: From fig 1, training data consists of detection evaded text, but from section 4.1 datasets is human generated text instead

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper is well written. 2. The idea is novel.

Weaknesses

1. SDA’s effectiveness hinges on the proxy detector used to extract disguise features, which may limit its universality. It is tuned against a specific surrogate detector (e.g. a RoBERTa-based ChatGPT detector). It generalizes to other detectors better than baselines, but lack explanation for this phenomenon. 2. SDA involves an iterative feature extraction process and the creation of an external knowledge base of disguised examples. While the experimental section provides some runtime measureme

Reviewer 03Rating 2Confidence 5

Strengths

1. The proposed method is fully implemented through prompt engineering and in-context learning, without requiring any fine-tuning of the model, which significantly reduces computational cost and implementation complexity. 2. The disguise features are expressed in natural language form, providing interpretability and helping to reveal the intrinsic differences between AIGT and HWT. 3. The authors conduct a comprehensive evaluation using multiple quantitative metrics, thoroughly assessing both the

Weaknesses

1. The proposed approach shows a strong dependency on prompt design and the chosen proxy detector, which raises concerns about its generalizability to broader domains or unseen detection models. 2. The experimental setup is limited, using only 1,000 samples extracted from the RAID dataset and focusing solely on summarization tasks; further validation on diverse tasks and datasets is necessary to demonstrate robustness. 3. The external knowledge base entirely relies on the disguised texts generat

Reviewer 04Rating 2Confidence 4

Strengths

1. A training-free evasion approach for AI-generated text detection that relies solely on prompt engineering to produce highly human-like, detector-evading outputs. 2. Better preservation of text quality relative to other baselines. 3. A relatively novel application of RAG.

Weaknesses

**This paper is not well written.** There are multiple citation errors and invalid figure references (see weaknesses below). It will require substantial revisions before publication. In its current form, it is not ready for a top-tier venue like ICLR. The main weaknesses, ranked by severity, are as follows: 1. **Exaggerated performance.** In this paper, you use the ChatGPT-detector as your proxy model, which is based on RoBERTa-base; all the detectors you attack in evaluation are also RoBERTa-b

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.