Explainability-Based Token Replacement on LLM-Generated Text
Hadi Mohammadi, Anastasia Giachanou, Daniel L. Oberski, and Ayoub Bagheri

TL;DR
This paper explores how explainable AI techniques can be used to modify AI-generated text to evade detection, and proposes ensemble detection methods to counteract such manipulations.
Contribution
It introduces explainability-based token replacement strategies to reduce AI text detectability and demonstrates the effectiveness of ensemble classifiers in maintaining detection robustness.
Findings
Token replacement reduces single classifier detectability
Ensemble classifiers remain effective across languages and domains
Explainability methods can identify influential tokens for manipulation
Abstract
Generative models, especially large language models (LLMs), have shown remarkable progress in producing text that appears human-like. However, they often exhibit patterns that make their output easier to detect than text written by humans. In this paper, we investigate how explainable AI (XAI) methods can be used to reduce the detectability of AI-generated text (AIGT) while also introducing a robust ensemble-based detection approach. We begin by training an ensemble classifier to distinguish AIGT from human-written text, then apply SHAP and LIME to identify tokens that most strongly influence its predictions. We propose four explainability-based token replacement strategies to modify these influential tokens. Our findings show that these token replacement approaches can significantly diminish a single classifier's ability to detect AIGT. However, our ensemble classifier maintains strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
