Towards Text-Image Interleaved Retrieval
Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun, Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang

TL;DR
This paper introduces the novel task of text-image interleaved retrieval (TIIR), creating a benchmark from wikiHow tutorials, and proposes a new model, MME, that effectively handles interleaved multimodal content for improved retrieval.
Contribution
The paper defines the TIIR task, constructs a benchmark dataset, and develops the Matryoshka Multimodal Embedder (MME) to efficiently process interleaved text-image sequences.
Findings
Existing models are ineffective on TIIR.
MME significantly reduces visual tokens and improves retrieval performance.
Extensive analysis and dataset release support future research.
Abstract
Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences, and the model is required to understand the semantics from the interleaved context for effective retrieval. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. To explore the task, we adapt several off-the-shelf retrievers and build a dense baseline by interleaved multimodal large language model (MLLM). We then propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity, to address the challenge of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsFocus
