Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification

Xuansheng Wu; Wenhao Yu; Xiaoming Zhai; Ninghao Liu

arXiv:2502.14133·cs.CL·July 29, 2025

Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification

Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, Ninghao Liu

PDF

Open Access

TL;DR

This paper introduces a framework that uses sparse autoencoders to interpret and regularize unintended features in LLM embeddings, enhancing the generalizability and fairness of text classifiers across various tasks.

Contribution

It proposes a novel method to identify and regularize unintended features in LLM latent spaces using sparse autoencoders and a regularizer based on feature similarity.

Findings

01

Improved classifier generalizability across tasks

02

Effective removal of unintended features

03

Enhanced fairness and privacy in classification

Abstract

Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Processing Techniques · Fuzzy Logic and Control Systems · Machine Learning and Data Classification

MethodsSparse Autoencoder