Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models
Md Rafi Ur Rashid, Vishnu Asutosh Dasu, Ye Wang, Gang Tan, Shagufta Mehnaz

TL;DR
This paper presents ASE, a novel inference-time framework using Chain-of-Thought reasoning to improve large language models' robustness against adversarial attacks while maintaining user experience.
Contribution
It introduces Adversarial Scenario Extrapolation (ASE), a new method that enhances LLM robustness and seamlessness through self-generative adversarial scenario contemplation.
Findings
Near-zero jailbreak success rates achieved.
Significant reduction in toxicity and bias scores.
Outperforms existing defenses in robustness and seamlessness.
Abstract
Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
