SAE-V: Interpreting Multimodal Models for Enhanced Alignment

Hantao Lou; Changye Li; Jiaming Ji; Yaodong Yang

arXiv:2502.17514·cs.LG·June 18, 2025

SAE-V: Interpreting Multimodal Models for Enhanced Alignment

Hantao Lou, Changye Li, Jiaming Ji, Yaodong Yang

PDF

Open Access 1 Video

TL;DR

SAE-V is a new interpretability framework for multimodal large language models that improves understanding of their internal mechanisms and enhances alignment quality through cross-modal feature analysis and data filtering.

Contribution

SAE-V extends Sparse Autoencoders to multimodal models, enabling detailed interpretation of cross-modal features and intrinsic data filtering for better alignment.

Findings

01

SAE-V achieves over 110% performance improvement with less than 50% data.

02

SAE-V provides fine-grained interpretability of model behavior and data quality.

03

SAE-V enhances alignment stability and interpretability in MLLMs.

Abstract

With the integration of image modality, the semantic space of multimodal large language models (MLLMs) is more complex than text-only models, making their interpretability more challenging and their alignment less stable, particularly susceptible to low-quality data, which can lead to inconsistencies between modalities, hallucinations, and biased outputs. As a result, developing interpretability methods for MLLMs is crucial for improving alignment quality and efficiency. In text-only LLMs, Sparse Autoencoders (SAEs) have gained attention for their ability to interpret latent representations. However, extending SAEs to multimodal settings presents new challenges due to modality fusion and the difficulty of isolating cross-modal representations. To address these challenges, we introduce SAE-V, a mechanistic interpretability framework that extends the SAE paradigm to MLLMs. By identifying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SAE-V: Interpreting Multimodal Models for Enhanced Alignment· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need