CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization
Junyi Li, Yongqiang Chen, Ningning Ding

TL;DR
CiPO introduces an iterative preference optimization framework to effectively unlearn specific knowledge from large reasoning models without impairing their reasoning capabilities.
Contribution
The paper proposes a novel counterfactual unlearning method that targets reasoning traces, enabling complete knowledge removal while maintaining reasoning performance.
Findings
CiPO successfully removes undesired knowledge from reasoning traces.
The method preserves the reasoning abilities of large models after unlearning.
Experiments show superior unlearning effectiveness on challenging benchmarks.
Abstract
Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
