From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning

Haoran Tang; Rajiv Khanna

arXiv:2601.22028·cs.LG·January 30, 2026

From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning

Haoran Tang, Rajiv Khanna

PDF

Open Access

TL;DR

This paper introduces CLReg, a contrastive regularizer that reduces entanglement between forgotten and retained concepts in LLMs, improving unlearning effectiveness without significant distribution shifts.

Contribution

The paper proposes a novel contrastive regularizer, CLReg, that explicitly reduces forget-retain entanglement in representations, backed by theoretical insights and empirical validation.

Findings

01

CLReg decreases forget-retain entanglement in representations.

02

It improves unlearning performance across benchmarks and model sizes.

03

CLReg does not introduce additional privacy risks.

Abstract

Most LLM unlearning methods aim to approximate retrain-from-scratch behaviors with minimal distribution shift, often via alignment-style objectives defined in the prediction space. While effective at reducing forgotten content generation, such approaches may act as suppression: forgotten concepts can persist in representations and remain entangled with retained knowledge. We introduce CLReg, a contrastive representation regularizer that identifies forget features while pushing them away from retain features, explicitly reducing forget-retain interference with minimal shifts on retain features. We provide first theoretical insights that relate representation shaping to entanglement reduction. Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Topic Modeling · Domain Adaptation and Few-Shot Learning