A Language Model's Guide Through Latent Space
Dimitri von R\"utte, Sotiris Anagnostidis, Gregor Bachmann, Thomas, Hofmann

TL;DR
This paper explores the extension of concept guidance in language models to complex concepts like humor and appropriateness, developing new evaluation metrics and revealing challenges in guiding models with these concepts.
Contribution
It introduces a novel metric for evaluating concept guidance and provides extensive experiments on guiding language models with diverse, complex concepts.
Findings
Some concepts like truthfulness are easier to guide than others.
Optimal detection accuracy does not always lead to better guidance.
Guidance effectiveness varies significantly across different concepts.
Abstract
Concept guidance has emerged as a cheap and simple way to control the behavior of language models by probing their hidden representations for concept vectors and using them to perturb activations at inference time. While the focus of previous work has largely been on truthfulness, in this paper we extend this framework to a richer set of concepts such as appropriateness, humor, creativity and quality, and explore to what degree current detection and guidance strategies work in these challenging settings. To facilitate evaluation, we develop a novel metric for concept guidance that takes into account both the success of concept elicitation as well as the potential degradation in fluency of the guided model. Our extensive experiments reveal that while some concepts such as truthfulness more easily allow for guidance with current techniques, novel concepts such as appropriateness or humor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
MethodsSparse Evolutionary Training · Focus
