Explaining How Visual, Textual and Multimodal Encoders Share Concepts
Cl\'ement Cornet, Romaric Besan\c{c}on, Herv\'e Le Borgne

TL;DR
This paper introduces new tools for comparing visual, textual, and multimodal encoders based on sparse autoencoder features, revealing shared representations and the influence of text pretraining across models.
Contribution
It proposes a novel quantitative indicator for cross-modality model comparison and a measure for shared features, enabling comprehensive analysis of different encoder types.
Findings
Visual features in VLMs are shared with text encoders.
Models trained in multimodal contexts share more representations.
The new tools facilitate revisiting and understanding encoder similarities.
Abstract
Sparse autoencoders (SAEs) have emerged as a powerful technique for extracting human-interpretable features from neural networks activations. Previous works compared different models based on SAE-derived features but those comparisons have been restricted to models within the same modality. We propose a novel indicator allowing quantitative comparison of models across SAE features, and use it to conduct a comparative study of visual, textual and multimodal encoders. We also propose to quantify the Comparative Sharedness of individual features between different classes of models. With these two new tools, we conduct several studies on 21 encoders of the three types, with two significantly different sizes, and considering generalist and domain specific datasets. The results allow to revisit previous studies at the light of encoders trained in a multimodal context and to quantify to which…
Peer Reviews
Decision·Submitted to ICLR 2026
Two new metrics are proposed that help to understand the similarities and differences between models. The authors show how these metrics can be used to uncover interesting details like the quality of the original corpora or "shared concepts" learned between models. The paper provides clear details and pointers to scripts on reproducing results and works with public data sets and models so should be highly reproducible.
The paper provides a comparative study of visual, textual and joint vision-text models. It would be super interesting to see what insights these measures could provide with the addition of audio to the assessed modalities.
* Analyzing similarities and differences across visual, textual, and multimodal encoders is valuable, as it can inform how future models are trained and aligned * The study spans a large and diverse set of 21 transformer encoders, offering broad coverage across modalities, datasets, and scales. * The paper includes a detailed limitation section, showing good awareness of scope boundaries and possible extensions.
* I found a bit surprising that CLIP image features are more correlated with DINOv2 image or than with SigLIP image (trained similarly to CLIP), Tab. 1. Same for SigLIP image being more correlated to CLIP and BERT text rather than SigLIP text encoder ! This make me question the proposed metrics. * It’s not clear how the observed correlations translate to real-world impact (measured with quantitative metrics), for example, whether they relate to model performance, bias, or hallucination behavior
1. Extensive experiments have been conducted – with/outside modality comparisons in popular models, multiple datasets 2. Qualitative analysis to identify underlying concepts has been done
My main concern is that all the numbers reported are in terms of the proposed metric wMPPC. While I understand intuitively why the use of weighting is important, I think there needs to be a comparison between wMPPC and older metrics, with examples that show the need for proposing wMPPC. Detailed questions and clarifications are listed in the Questions section.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Natural Language Processing Techniques
