Vision Search Assistant: Empower Vision-Language Models as Multimodal   Search Engines

Zhixin Zhang; Yiyuan Zhang; Xiaohan Ding; Xiangyu Yue

arXiv:2410.21220·cs.CV·October 29, 2024

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, Xiangyu Yue

PDF

Open Access 1 Repo

TL;DR

The paper introduces Vision Search Assistant, a framework that combines vision-language models with web agents to improve open-world visual question answering, especially for unseen objects, by leveraging real-time web information.

Contribution

It presents a novel collaboration framework between VLMs and web agents for open-world retrieval-augmented generation, enhancing VLMs' ability to handle novel visual content.

Findings

01

Significantly outperforms existing models on open-set QA benchmarks.

02

Effectively integrates visual and textual data for better understanding.

03

Applicable to various existing VLMs with improved accuracy.

Abstract

Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user's question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cnzzx/vsa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedia, Religion, Digital Communication · Religious Tourism and Spaces