Unifying Multimodal Retrieval via Document Screenshot Embedding
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin

TL;DR
This paper introduces Document Screenshot Embedding (DSE), a new retrieval method that directly encodes document screenshots into dense vectors, eliminating the need for content extraction and effectively handling diverse document formats.
Contribution
The paper proposes DSE, a novel screenshot-based retrieval approach using vision-language models, and creates Wiki-SS, a large dataset for evaluating multimodal document retrieval.
Findings
DSE outperforms BM25 by 17 points in top-1 accuracy.
DSE exceeds OCR-based methods by over 15 points in slide retrieval nDCG@10.
DSE effectively handles diverse document modalities without content parsing.
Abstract
In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information loss. To this end, we propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image and layout). DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗llamaindex/vdr-2b-multi-v1model· 1.9k dl· ♡ 1281.9k dl♡ 128
- 🤗MrLight/dse-qwen2-2b-mrl-v1model· 14k dl· ♡ 6814k dl♡ 68
- 🤗nomic-ai/colnomic-embed-multimodal-7bmodel· 7.4k dl· ♡ 1037.4k dl♡ 103
- 🤗Tevatron/dse-phi3-docmatix-v1model· 15 dl· ♡ 915 dl♡ 9
- 🤗Tevatron/dse-phi3-docmatix-v2model· 13 dl· ♡ 113 dl♡ 1
- 🤗MrLight/dse-phi35-vidore-ftmodel· 15 dl· ♡ 1015 dl♡ 10
- 🤗llamaindex/vdr-2b-v1model· 121 dl· ♡ 13121 dl♡ 13
- 🤗nomic-ai/colnomic-embed-multimodal-3bmodel· 2.0k dl· ♡ 372.0k dl♡ 37
- 🤗nomic-ai/nomic-embed-multimodal-3bmodel· 2.6k dl· ♡ 292.6k dl♡ 29
- 🤗nomic-ai/nomic-embed-multimodal-7bmodel· 3.0k dl· ♡ 483.0k dl♡ 48
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies
