OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Chuancheng Shi; Wenhua Wu; Fei Shen; Xiaogang Zhu; Kun Hu; Zhiyong Wang

arXiv:2603.11493·cs.CV·March 13, 2026

OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Chuancheng Shi, Wenhua Wu, Fei Shen, Xiaogang Zhu, Kun Hu, Zhiyong Wang

PDF

Open Access

TL;DR

OrthoEraser introduces a novel orthogonal projection method using sparse autoencoders to precisely erase sensitive concepts in text-to-image models while preserving benign attributes, improving safety and fidelity.

Contribution

It proposes a new orthogonalization-based concept erasure technique that disentangles sensitive and benign features, reducing collateral damage in T2I models.

Findings

01

Achieves high erasure precision and safety.

02

Outperforms state-of-the-art baselines.

03

Preserves generative quality of benign attributes.

Abstract

Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)