R-TOFU: Unlearning in Large Reasoning Models
Sangyeon Yoon, Wonje Jeung, Albert No

TL;DR
This paper introduces R-TOFU, a benchmark for evaluating unlearning in large reasoning models, highlighting challenges in removing private information from multi-step reasoning traces and proposing new methods to improve unlearning effectiveness.
Contribution
The paper presents R-TOFU, the first benchmark for unlearning in reasoning models, and proposes Reasoned IDK, a preference-optimization method that better balances forgetting and utility.
Findings
Answer-only objectives leave residual knowledge in reasoning traces.
Decoding variants can still reveal forgotten content despite unlearning.
R-TOFU provides a systematic foundation for studying unlearning in LRMs.
Abstract
Large Reasoning Models (LRMs) embed private or copyrighted information not only in their final answers but also throughout multi-step chain-of-thought (CoT) traces, making reliable unlearning far more demanding than in standard LLMs. We introduce Reasoning-TOFU (R-TOFU), the first benchmark tailored to this setting. R-TOFU augments existing unlearning tasks with realistic CoT annotations and provides step-wise metrics that expose residual knowledge invisible to answer-level checks. Using R-TOFU, we carry out a comprehensive comparison of gradient-based and preference-optimization baselines and show that conventional answer-only objectives leave substantial forget traces in reasoning. We further propose Reasoned IDK, a preference-optimization variant that preserves coherent yet inconclusive reasoning, achieving a stronger balance between forgetting efficacy and model utility than earlier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsRough Sets and Fuzzy Logic · Machine Learning and Data Classification · AI-based Problem Solving and Planning
