Probe-Free Low-Rank Activation Intervention
Chonghe Jiang, Bao Nguyen, Anthony Man-Cho So, Viet Anh Nguyen

TL;DR
This paper introduces FLORAIN, a probe-free low-rank activation intervention method for language models that improves truthfulness and quality without requiring activation probes or classifiers.
Contribution
FLORAIN is a novel probe-free activation intervention technique that uses a low-rank mapping trained to steer language model activations towards desirable content.
Findings
FLORAIN outperforms baseline methods in truthfulness and quality.
The method is efficient due to a smooth optimization process.
It is effective across multiple models and tasks.
Abstract
Language models (LMs) can produce texts that appear accurate and coherent but contain untruthful or toxic content. Inference-time interventions that edit the hidden activations have shown promising results in steering the LMs towards desirable generations. Existing activation intervention methods often comprise an activation probe to detect undesirable generation, triggering the activation modification to steer subsequent generation. This paper proposes a probe-free intervention method FLORAIN for all attention heads in a specific activation layer. It eliminates the need to train classifiers for probing purposes. The intervention function is parametrized by a sample-wise nonlinear low-rank mapping, which is trained by minimizing the distance between the modified activations and their projection onto the manifold of desirable content. Under specific constructions of the manifold and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Neuroscience and Neural Engineering · EEG and Brain-Computer Interfaces
