Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders
Ananya Joshi, Celia Cintas, Skyler Speakman

TL;DR
This paper introduces a novel method using Sparse Autoencoders to enable precise topic alignment in large language models, allowing for flexible, efficient, and interpretable output steering without extensive fine-tuning.
Contribution
The authors propose a new SAE-based approach that scores and modifies neurons for any topic, improving alignment flexibility and efficiency over traditional fine-tuning methods.
Findings
Enhanced alignment accuracy on diverse datasets
Reduced training time compared to fine-tuning
Maintained acceptable inference speed for practical use
Abstract
Recent work shows that Sparse Autoencoders (SAE) applied to large language model (LLM) layers have neurons corresponding to interpretable concepts. These SAE neurons can be modified to align generated outputs, but only towards pre-identified topics and with some parameter tuning. Our approach leverages the observational and modification properties of SAEs to enable alignment for any topic. This method 1) scores each SAE neuron by its semantic similarity to an alignment text and uses them to 2) modify SAE-layer-level outputs by emphasizing topic-aligned neurons. We assess the alignment capabilities of this approach on diverse public topic datasets including Amazon reviews, Medicine, and Sycophancy, across the currently available open-source LLMs and SAE pairs (GPT2 and Gemma) with multiple SAEs configurations. Experiments aligning to medical prompts reveal several benefits over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
