Towards Text-Image Interleaved Retrieval

Xin Zhang; Ziqi Dai; Yongqi Li; Yanzhao Zhang; Dingkun Long; Pengjun; Xie; Meishan Zhang; Jun Yu; Wenjie Li; Min Zhang

arXiv:2502.12799·cs.CL·February 19, 2025

Towards Text-Image Interleaved Retrieval

Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun, Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces the novel task of text-image interleaved retrieval (TIIR), creating a benchmark from wikiHow tutorials, and proposes a new model, MME, that effectively handles interleaved multimodal content for improved retrieval.

Contribution

The paper defines the TIIR task, constructs a benchmark dataset, and develops the Matryoshka Multimodal Embedder (MME) to efficiently process interleaved text-image sequences.

Findings

01

Existing models are ineffective on TIIR.

02

MME significantly reduces visual tokens and improves retrieval performance.

03

Extensive analysis and dataset release support future research.

Abstract

Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences, and the model is required to understand the semantics from the interleaved context for effective retrieval. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. To explore the task, we adapt several off-the-shelf retrievers and build a dense baseline by interleaved multimodal large language model (MLLM). We then propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity, to address the challenge of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vec-ai/wikihow-tiir
noneOfficial

Datasets

vec-ai/wikiHow-TIIR
dataset· 74 dl
74 dl

Videos

Towards Text-Image Interleaved Retrieval· underline

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsFocus