Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
Madeline Brumley, Joe Kwon, David Krueger, Dmitrii Krasheninnikov,, Usman Anwar

TL;DR
This paper compares bottom-up and top-down interpretability methods for large language models, revealing their strengths and limitations across different in-context learning tasks.
Contribution
It provides a quantitative comparison of vector steering methods, highlighting their task-specific effectiveness and guiding future interpretability research.
Findings
ICV outperforms FV in behavioral shifting tasks
FV excels in tasks requiring precision
Both methods are effective only on specific task types
Abstract
A key objective of interpretability research on large language models (LLMs) is to develop methods for robustly steering models toward desired behaviors. To this end, two distinct approaches to interpretability -- ``bottom-up" and ``top-down" -- have been presented, but there has been little quantitative comparison between them. We present a case study comparing the effectiveness of representative vector steering methods from each branch: function vectors (FV; arXiv:2310.15213), as a bottom-up method, and in-context vectors (ICV; arXiv:2311.06668) as a top-down method. While both aim to capture compact representations of broad in-context learning tasks, we find they are effective only on specific types of tasks: ICVs outperform FVs in behavioral shifting, whereas FVs excel in tasks requiring more precision. We discuss the implications for future evaluations of steering methods and for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman-Automation Interaction and Safety
