Unified Multimodal Interleaved Document Representation for Retrieval

Jaewoo Lee; Joonho Ko; Jinheon Baek; Soyeong Jeong; Sung Ju Hwang

arXiv:2410.02729·cs.CL·January 26, 2026

Unified Multimodal Interleaved Document Representation for Retrieval

Jaewoo Lee, Joonho Ko, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

PDF

Open Access 1 Video

TL;DR

This paper introduces a unified approach for embedding multimodal documents, including text, images, and tables, to improve retrieval performance by capturing overall context and interactions across modalities.

Contribution

It proposes a novel interleaved multimodal document embedding method that integrates multiple modalities and merges segmented passages into a single representation for better retrieval.

Findings

01

Significantly outperforms baseline methods in diverse IR scenarios.

02

Effectively captures interactions between text, images, and tables.

03

Improves retrieval accuracy by considering full document context.

Abstract

Information Retrieval (IR) methods aim to identify documents relevant to a query, which have been widely applied in various natural language tasks. However, existing approaches typically consider only the textual content within documents, overlooking the fact that documents can contain multiple modalities, including images and tables. Also, they often segment each long document into multiple discrete passages for embedding, which prevents them from capturing the overall document context and interactions between paragraphs. To address these two challenges, we propose a method that holistically embeds documents interleaved with multiple modalities by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unified Multimodal Interleaved Document Representation for Retrieval· underline

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Text and Document Classification Technologies