Unifying Multimodal Retrieval via Document Screenshot Embedding

Xueguang Ma; Sheng-Chieh Lin; Minghan Li; Wenhu Chen; Jimmy Lin

arXiv:2406.11251·cs.IR·December 3, 2024·1 cites

Unifying Multimodal Retrieval via Document Screenshot Embedding

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin

PDF

Open Access 10 Models 1 Datasets

TL;DR

This paper introduces Document Screenshot Embedding (DSE), a new retrieval method that directly encodes document screenshots into dense vectors, eliminating the need for content extraction and effectively handling diverse document formats.

Contribution

The paper proposes DSE, a novel screenshot-based retrieval approach using vision-language models, and creates Wiki-SS, a large dataset for evaluating multimodal document retrieval.

Findings

01

DSE outperforms BM25 by 17 points in top-1 accuracy.

02

DSE exceeds OCR-based methods by over 15 points in slide retrieval nDCG@10.

03

DSE effectively handles diverse document modalities without content parsing.

Abstract

In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information loss. To this end, we propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image and layout). DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Tevatron/docmatix-ir
dataset· 43 dl
43 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies