One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle, Chen Cecilia Liu, Jonas Pfeiffer, Iryna Gurevych

TL;DR
This paper investigates whether combining different vision encoders enhances vision and language task performance, finding that diverse encoders are complementary and can improve results beyond single-encoder approaches.
Contribution
The study provides a comprehensive analysis of multiple vision encoders' complementarity for V+L tasks, highlighting the benefits of using diverse, purpose-designed encoders.
Findings
Diverse vision encoders complement each other in V+L tasks.
Performance improves when combining multiple encoders, not just by simple ensembling.
Future VEs designed specifically for V+L tasks can further enhance performance.
Abstract
Current multimodal models, aimed at solving Vision and Language (V+L) tasks, predominantly repurpose Vision Encoders (VE) as feature extractors. While many VEs -- of different architectures, trained on different data and objectives -- are publicly available, they are not designed for the downstream V+L tasks. Nonetheless, most current work assumes that a \textit{single} pre-trained VE can serve as a general-purpose encoder. In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary, i.e. if providing the model with features from multiple VEs can improve the performance on a target task, and how they are combined. We exhaustively experiment with three popular VEs on six downstream V+L tasks and analyze the attention and VE-dropout patterns. Our analyses suggest that diverse VEs complement each other, resulting in improved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
