BRAVE: Broadening the visual encoding of vision-language models
O\u{g}uzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin, Kulshrestha, Amir Zamir, Federico Tombari

TL;DR
BRAVE enhances vision-language models by integrating multiple visual encodings to improve performance and address limitations like visual hallucination, achieving state-of-the-art results with fewer trainable parameters.
Contribution
The paper introduces BRAVE, a novel method that consolidates features from multiple frozen encoders into a versatile representation for VLMs, improving performance and robustness.
Findings
BRAVE outperforms existing methods on captioning and VQA benchmarks.
Using multiple encoders broadens visual understanding and reduces hallucinations.
BRAVE requires fewer trainable parameters and offers a more compressed representation.
Abstract
Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
