Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, Xiangyu Yue

TL;DR
The paper introduces Vision Search Assistant, a framework that combines vision-language models with web agents to improve open-world visual question answering, especially for unseen objects, by leveraging real-time web information.
Contribution
It presents a novel collaboration framework between VLMs and web agents for open-world retrieval-augmented generation, enhancing VLMs' ability to handle novel visual content.
Findings
Significantly outperforms existing models on open-set QA benchmarks.
Effectively integrates visual and textual data for better understanding.
Applicable to various existing VLMs with improved accuracy.
Abstract
Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user's question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedia, Religion, Digital Communication · Religious Tourism and Spaces
