BRAVE: Broadening the visual encoding of vision-language models

O\u{g}uzhan Fatih Kar; Alessio Tonioni; Petra Poklukar; Achin; Kulshrestha; Amir Zamir; Federico Tombari

arXiv:2404.07204·cs.CV·April 11, 2024·2 cites

BRAVE: Broadening the visual encoding of vision-language models

O\u{g}uzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin, Kulshrestha, Amir Zamir, Federico Tombari

PDF

Open Access

TL;DR

BRAVE enhances vision-language models by integrating multiple visual encodings to improve performance and address limitations like visual hallucination, achieving state-of-the-art results with fewer trainable parameters.

Contribution

The paper introduces BRAVE, a novel method that consolidates features from multiple frozen encoders into a versatile representation for VLMs, improving performance and robustness.

Findings

01

BRAVE outperforms existing methods on captioning and VQA benchmarks.

02

Using multiple encoders broadens visual understanding and reduces hallucinations.

03

BRAVE requires fewer trainable parameters and offers a more compressed representation.

Abstract

Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training