Meta-Models: An Architecture for Decoding LLM Behaviors Through   Interpreted Embeddings and Natural Language

Anthony Costarelli; Mat Allen; Severin Field

arXiv:2410.02472·cs.LG·November 8, 2024

Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

Anthony Costarelli, Mat Allen, Severin Field

PDF

Open Access 1 Repo

TL;DR

This paper introduces meta-models, an architecture that interprets LLM behaviors through activations and natural language, demonstrating good generalization to out-of-distribution deceptive scenarios and aiding in faithful model interpretation.

Contribution

The paper proposes a novel meta-model architecture that uses activations to answer natural language questions about LLM behaviors, enhancing interpretability and generalization.

Findings

01

Meta-models generalize well to out-of-distribution tasks

02

Meta-models effectively interpret deceptive behaviors in LLMs

03

The approach opens new avenues for faithful model explanation

Abstract

As Large Language Models (LLMs) become increasingly integrated into our daily lives, the potential harms from deceptive behavior underlie the need for faithfully interpreting their decision-making. While traditional probing methods have shown some effectiveness, they remain best for narrowly scoped tasks while more comprehensive explanations are still necessary. To this end, we investigate meta-models-an architecture using a "meta-model" that takes activations from an "input-model" and answers natural language questions about the input-model's behaviors. We evaluate the meta-model's ability to generalize by training them on selected task types and assessing their out-of-distribution performance in deceptive scenarios. Our findings show that meta-models generalize well to out-of-distribution tasks and point towards opportunities for future research in this area. Our code is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

acostarelli/meta-models-public
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security