TL;DR
This paper introduces FLAS, a flow-based activation steering method that learns to modify language model activations at inference time, outperforming prompting on unseen concepts without per-concept tuning.
Contribution
FLAS is the first learned, concept-conditioned flow method for activation steering that surpasses prompting and challenges previous assumptions about activation space geometry.
Findings
FLAS outperforms prompting on AxBench benchmarks.
FLAS achieves harmonic means of 1.015 and 1.113 on Gemma-2-2B-IT and Gemma-2-9B-IT.
Learned flows reveal curved, multi-step, token-varying activation trajectories.
Abstract
Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
