ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning

Shaowei Guan; Yu Zhai; Zhengyu Zhang; Yanze Wang; Hin Chi Kwok

arXiv:2511.13771·cs.CR·November 19, 2025

ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning

Shaowei Guan, Yu Zhai, Zhengyu Zhang, Yanze Wang, Hin Chi Kwok

PDF

Open Access

TL;DR

ExplainableGuard is a novel framework that uses chain-of-thought reasoning to detect, neutralize, and explain adversarial attacks on large language models, enhancing transparency and trustworthiness.

Contribution

It introduces an interpretable adversarial defense method leveraging chain-of-thought reasoning, providing step-by-step explanations and improved trust in LLM security.

Findings

01

Effective detection and neutralization of adversarial attacks.

02

Human evaluations favor ExplainableGuard's explanations.

03

Promising results on GLUE and IMDB datasets.

Abstract

Large Language Models (LLMs) are increasingly vulnerable to adversarial attacks that can subtly manipulate their outputs. While various defense mechanisms have been proposed, many operate as black boxes, lacking transparency in their decision-making. This paper introduces ExplainableGuard, an interpretable adversarial defense framework leveraging the chain-of-thought (CoT) reasoning capabilities of DeepSeek-Reasoner. Our approach not only detects and neutralizes adversarial perturbations in text but also provides step-by-step explanations for each defense action. We demonstrate how tailored CoT prompts guide the LLM to perform a multi-faceted analysis (character, word, structural, and semantic) and generate a purified output along with a human-readable justification. Preliminary results on the GLUE Benchmark and IMDB Movie Reviews dataset show promising defense efficacy. Additionally, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications