Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

John Yan; Michael Yu; Yuqi Sun; Alexander Duffy; Tyler Marques; Matthew Lyle Olson

arXiv:2602.05183·cs.LG·February 9, 2026

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

John Yan, Michael Yu, Yuqi Sun, Alexander Duffy, Tyler Marques, Matthew Lyle Olson

PDF

Open Access

TL;DR

This paper introduces a data-centric interpretability framework using Sparse Autoencoders and LLM-summarizer methods to analyze training dynamics of large language models in multi-agent reinforcement learning, revealing behaviors and potential issues.

Contribution

It presents Meta-Autointerp, a novel method for grouping SAE features into interpretable hypotheses, and demonstrates its effectiveness in understanding complex training behaviors and improving agent performance.

Findings

01

90% of SAE Meta-Features are statistically significant

02

Discovered behaviors include role-playing, language switching, and reward hacking

03

Augmenting system prompts improved agent scores by +14.2%

Abstract

Large language models (LLMs) are increasingly trained in complex Reinforcement Learning, multi-agent environments, making it difficult to understand how behavior changes over training. Sparse Autoencoders (SAEs) have recently shown to be useful for data-centric interpretability. In this work, we analyze large-scale reinforcement learning training runs from the sophisticated environment of Full-Press Diplomacy by applying pretrained SAEs, alongside LLM-summarizer methods. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover fine-grained behaviors including role-playing patterns, degenerate outputs, language switching, alongside high-level strategic behaviors and environment-specific bugs. Through automated evaluation, we validate that 90% of discovered SAE Meta-Features are significant, and find a surprising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning