TL;DR
This paper introduces the Frame Representation Hypothesis, extending the Linear Representation Hypothesis to multi-token words, enabling interpretability and concept-guided control of large language models for safer and more transparent AI.
Contribution
It proposes a novel multi-token word interpretation framework based on frames, allowing concept-based control and analysis of LLMs, extending prior single-token approaches.
Findings
Demonstrates gender and language bias detection in Llama 3.1, Gemma 2, and Phi 3.
Shows potential for bias remediation and content safety improvements.
Provides open-source code for implementation.
Abstract
Interpretability is a key challenge in fostering trust for Large Language Models (LLMs), which stems from the complexity of extracting reasoning from model's parameters. We present the Frame Representation Hypothesis, a theoretically robust framework grounded in the Linear Representation Hypothesis (LRH) to interpret and control LLMs by modeling multi-token words. Prior research explored LRH to connect LLM representations with linguistic concepts, but was limited to single token analysis. As most words are composed of several tokens, we extend LRH to multi-token words, thereby enabling usage on any textual data with thousands of concepts. To this end, we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept. We showcase these tools…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLLaMA
