Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval
Ze Liu, Zhengyang Liang, Junjie Zhou, Zheng Liu, Defu Lian

TL;DR
This paper introduces Visualized Information Retrieval (Vis-IR), a new paradigm using screenshots to unify multimodal data for retrieval, supported by a large dataset, a universal embedding model, and a comprehensive benchmark.
Contribution
It presents the VIRA dataset, the UniSE retrieval model, and the MVRB benchmark, advancing the field of multimodal retrieval with visualized information.
Findings
UniSE outperforms existing multimodal retrievers.
VIRA dataset enables diverse retrieval tasks.
MVRB benchmark facilitates comprehensive evaluation.
Abstract
With the popularity of multimodal techniques, it receives growing interests to acquire useful information in visual forms. In this work, we formally define an emerging IR paradigm called \textit{Visualized Information Retrieval}, or \textbf{Vis-IR}, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called \textbf{Screenshots}, for various retrieval applications. We further make three key contributions for Vis-IR. First, we create \textbf{VIRA} (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and question-answer formats. Second, we develop \textbf{UniSE} (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques · Data Management and Algorithms
