LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models
Jingyuan Yang, Rongjun Li, Weixuan Wang, Ziyu Zhou, Zhiyong Feng, Wei, Peng

TL;DR
LF-Steering introduces a feature-level activation steering method using sparse autoencoders to improve semantic consistency in large language models, addressing the polysemanticity issue and enabling precise control over model outputs.
Contribution
The paper proposes LF-Steering, a novel approach that maps transformer hidden states into a sparse feature space for better semantic control, advancing beyond component-level steering methods.
Findings
Significant improvement in semantic consistency across NLU and NLG tasks.
Effective decoupling of features reduces interference during steering.
Enhanced performance demonstrated on multiple datasets.
Abstract
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. Recently, activation steering, a technique that modulates LLMs' behaviours by adjusting their latent representations during inference time, has been explored to improve the semantic consistency of LLMs. However, these methods typically operate at the model component level, such as layer hidden states or attention head outputs. They face a challenge due to the ``polysemanticity issue'', where the model components of LLMs typically encode multiple entangled features, making precise steering difficult. To address this challenge, we drill down to feature-level representations and propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency. More specifically, our method maps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · Sparse Autoencoder
