Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

Rachneet Sachdeva; Rima Hazra; Iryna Gurevych

arXiv:2501.01872·cs.CL·October 1, 2025

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

Rachneet Sachdeva, Rima Hazra, Iryna Gurevych

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel contrastive reasoning-based jailbreak technique called POATE that effectively probes language model defenses, and proposes new reasoning methods to improve model robustness against such attacks.

Contribution

The paper presents POATE, a new contrastive reasoning attack method, and introduces Intent-Aware CoT and Reverse Thinking CoT to defend against these sophisticated jailbreaks.

Findings

01

POATE achieves ~44% attack success rate across six model families.

02

Proposed defenses significantly reduce vulnerability to contrastive reasoning attacks.

03

Enhanced reasoning methods improve model robustness against subtle adversarial prompts.

Abstract

Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities. Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities. In this work, we introduce POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ukplab/poate-attack
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Pragmatism in Philosophy and Education

MethodsAttention Is All You Need · Absolute Position Encodings · Softmax · Linear Layer · Adam · Residual Connection · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing