TL;DR
BARRIER introduces a novel geometric framework for robust machine unlearning by bounding activation regions, enabling targeted concept erasure with formal guarantees and minimal collateral damage.
Contribution
It shifts the intervention focus from model weights to activation space geometry, employing Interval Arithmetic for rigorous and aggressive unlearning updates.
Findings
Matches state-of-the-art trade-offs in concept erasure
Provides formal bounds on model response during unlearning
Ensures preservation of neutral concepts with aggressive updates
Abstract
Machine unlearning has reached a critical bottleneck. As traditional weight-space interventions focus primarily on erasing targeted concepts, they often fail to prevent the unintended suppression of other significant representations. This leads to substantial collateral damage, with essential knowledge being forgotten, because these methods lack formal mathematical guarantees for the preservation of neutral concepts. To avoid degradation, they are frequently forced into conservative updates. We propose BARRIER (Bounded Activation Regions for Robust Information Erasure), a paradigm-shifting framework that shifts the locus of intervention from static model weights to the dynamic geometry of hidden-layer activations. Unlike existing methods, BARRIER employs Interval Arithmetic (IA) on SVD-based projections of the activation space to encapsulate the specific target region within a bounding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
