Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification
Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, Ninghao Liu

TL;DR
This paper introduces a framework that uses sparse autoencoders to interpret and regularize unintended features in LLM embeddings, enhancing the generalizability and fairness of text classifiers across various tasks.
Contribution
It proposes a novel method to identify and regularize unintended features in LLM latent spaces using sparse autoencoders and a regularizer based on feature similarity.
Findings
Improved classifier generalizability across tasks
Effective removal of unintended features
Enhanced fairness and privacy in classification
Abstract
Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Processing Techniques · Fuzzy Logic and Control Systems · Machine Learning and Data Classification
MethodsSparse Autoencoder
