Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

Chengda Lu; Xiaoyu Fan; Yu Huang; Rongwu Xu; Jijie Li; Wei Xu

arXiv:2505.17650·cs.AI·May 26, 2025

Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

Chengda Lu, Xiaoyu Fan, Yu Huang, Rongwu Xu, Jijie Li, Wei Xu

PDF

TL;DR

This paper investigates whether Chain-of-Thought reasoning genuinely reduces the harmfulness of jailbreaking attacks on AI models, revealing complex effects and proposing a new jailbreak method to test these insights.

Contribution

It provides a theoretical analysis of CoT reasoning's dual effects on jailbreaking harmfulness and introduces a novel jailbreak method, FicDetail, to empirically validate these effects.

Findings

01

CoT reasoning has dual effects on jailbreaking harmfulness.

02

FicDetail effectively demonstrates the practical implications of the theoretical insights.

03

Theoretical analysis clarifies the mechanisms behind CoT's impact on security.

Abstract

Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FicDetail, whose practical performance validates our theoretical findings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.