SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Hantao Lou, Changye Li, Jiaming Ji, Yaodong Yang

TL;DR
SAE-V is a new interpretability framework for multimodal large language models that improves understanding of their internal mechanisms and enhances alignment quality through cross-modal feature analysis and data filtering.
Contribution
SAE-V extends Sparse Autoencoders to multimodal models, enabling detailed interpretation of cross-modal features and intrinsic data filtering for better alignment.
Findings
SAE-V achieves over 110% performance improvement with less than 50% data.
SAE-V provides fine-grained interpretability of model behavior and data quality.
SAE-V enhances alignment stability and interpretability in MLLMs.
Abstract
With the integration of image modality, the semantic space of multimodal large language models (MLLMs) is more complex than text-only models, making their interpretability more challenging and their alignment less stable, particularly susceptible to low-quality data, which can lead to inconsistencies between modalities, hallucinations, and biased outputs. As a result, developing interpretability methods for MLLMs is crucial for improving alignment quality and efficiency. In text-only LLMs, Sparse Autoencoders (SAEs) have gained attention for their ability to interpret latent representations. However, extending SAEs to multimodal settings presents new challenges due to modality fusion and the difficulty of isolating cross-modal representations. To address these challenges, we introduce SAE-V, a mechanistic interpretability framework that extends the SAE paradigm to MLLMs. By identifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
MethodsSoftmax · Attention Is All You Need
