Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions
Rachneet Sachdeva, Rima Hazra, Iryna Gurevych

TL;DR
This paper introduces a novel contrastive reasoning-based jailbreak technique called POATE that effectively probes language model defenses, and proposes new reasoning methods to improve model robustness against such attacks.
Contribution
The paper presents POATE, a new contrastive reasoning attack method, and introduces Intent-Aware CoT and Reverse Thinking CoT to defend against these sophisticated jailbreaks.
Findings
POATE achieves ~44% attack success rate across six model families.
Proposed defenses significantly reduce vulnerability to contrastive reasoning attacks.
Enhanced reasoning methods improve model robustness against subtle adversarial prompts.
Abstract
Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities. Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities. In this work, we introduce POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Pragmatism in Philosophy and Education
MethodsAttention Is All You Need · Absolute Position Encodings · Softmax · Linear Layer · Adam · Residual Connection · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing
