Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?
Chengda Lu, Xiaoyu Fan, Yu Huang, Rongwu Xu, Jijie Li, Wei Xu

TL;DR
This paper investigates whether Chain-of-Thought reasoning genuinely reduces the harmfulness of jailbreaking attacks on AI models, revealing complex effects and proposing a new jailbreak method to test these insights.
Contribution
It provides a theoretical analysis of CoT reasoning's dual effects on jailbreaking harmfulness and introduces a novel jailbreak method, FicDetail, to empirically validate these effects.
Findings
CoT reasoning has dual effects on jailbreaking harmfulness.
FicDetail effectively demonstrates the practical implications of the theoretical insights.
Theoretical analysis clarifies the mechanisms behind CoT's impact on security.
Abstract
Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FicDetail, whose practical performance validates our theoretical findings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
