Steered Generation via Gradient Descent on Sparse Features

Sumanta Bhattacharyya; Pedram Rooshenas

arXiv:2502.18644·cs.CL·February 27, 2025

Steered Generation via Gradient Descent on Sparse Features

Sumanta Bhattacharyya, Pedram Rooshenas

PDF

Open Access

TL;DR

This paper introduces a method to steer large language models' outputs by training sparse autoencoders to modify internal representations, enabling precise control over stylistic and cognitive attributes of generated text.

Contribution

It proposes a novel approach of training sparse autoencoders on LLM query embeddings to enable targeted manipulation of output characteristics through gradient-based optimization.

Findings

01

Effective adjustment of LLM output style and cognitive complexity.

02

Controlled transformation of generated feedback in educational settings.

03

Demonstrated precise steering via sparse latent space manipulation.

Abstract

Large language models (LLMs) encode a diverse range of linguistic features within their latent representations, which can be harnessed to steer their output toward specific target characteristics. In this paper, we modify the internal structure of LLMs by training sparse autoencoders to learn a sparse representation of the query embedding, allowing precise control over the model's attention distribution. We demonstrate that manipulating this sparse representation effectively transforms the output toward different stylistic and cognitive targets. Specifically, in an educational setting, we show that the cognitive complexity of LLM-generated feedback can be systematically adjusted by modifying the encoded query representation at a specific layer. To achieve this, we guide the learned sparse embedding toward the representation of samples from the desired cognitive complexity level, using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis