Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language
Anthony Costarelli, Mat Allen, Severin Field

TL;DR
This paper introduces meta-models, an architecture that interprets LLM behaviors through activations and natural language, demonstrating good generalization to out-of-distribution deceptive scenarios and aiding in faithful model interpretation.
Contribution
The paper proposes a novel meta-model architecture that uses activations to answer natural language questions about LLM behaviors, enhancing interpretability and generalization.
Findings
Meta-models generalize well to out-of-distribution tasks
Meta-models effectively interpret deceptive behaviors in LLMs
The approach opens new avenues for faithful model explanation
Abstract
As Large Language Models (LLMs) become increasingly integrated into our daily lives, the potential harms from deceptive behavior underlie the need for faithfully interpreting their decision-making. While traditional probing methods have shown some effectiveness, they remain best for narrowly scoped tasks while more comprehensive explanations are still necessary. To this end, we investigate meta-models-an architecture using a "meta-model" that takes activations from an "input-model" and answers natural language questions about the input-model's behaviors. We evaluate the meta-model's ability to generalize by training them on selected task types and assessing their out-of-distribution performance in deceptive scenarios. Our findings show that meta-models generalize well to out-of-distribution tasks and point towards opportunities for future research in this area. Our code is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
