Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders

Ananya Joshi; Celia Cintas; Skyler Speakman

arXiv:2506.12576·cs.CL·July 1, 2025

Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders

Ananya Joshi, Celia Cintas, Skyler Speakman

PDF

Open Access

TL;DR

This paper introduces a novel method using Sparse Autoencoders to enable precise topic alignment in large language models, allowing for flexible, efficient, and interpretable output steering without extensive fine-tuning.

Contribution

The authors propose a new SAE-based approach that scores and modifies neurons for any topic, improving alignment flexibility and efficiency over traditional fine-tuning methods.

Findings

01

Enhanced alignment accuracy on diverse datasets

02

Reduced training time compared to fine-tuning

03

Maintained acceptable inference speed for practical use

Abstract

Recent work shows that Sparse Autoencoders (SAE) applied to large language model (LLM) layers have neurons corresponding to interpretable concepts. These SAE neurons can be modified to align generated outputs, but only towards pre-identified topics and with some parameter tuning. Our approach leverages the observational and modification properties of SAEs to enable alignment for any topic. This method 1) scores each SAE neuron by its semantic similarity to an alignment text and uses them to 2) modify SAE-layer-level outputs by emphasizing topic-aligned neurons. We assess the alignment capabilities of this approach on diverse public topic datasets including Amazon reviews, Medicine, and Sycophancy, across the currently available open-source LLMs and SAE pairs (GPT2 and Gemma) with multiple SAEs configurations. Experiments aligning to medical prompts reveal several benefits over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques