ATIR: Towards Audio-Text Interleaved Contextual Retrieval
Tong Zhao, Chenghao Zhang, Yutao Zhu, Zhicheng Dou

TL;DR
This paper introduces the ATIR task and benchmark for audio-text interleaved retrieval, evaluates models on it, and proposes a novel token compression method to improve multimodal retrieval performance.
Contribution
The work defines a new interleaved audio-text retrieval task, creates a comprehensive benchmark, and proposes a novel token compression technique for multimodal models.
Findings
ATIR model outperforms strong baselines in experiments.
The benchmark unifies four types of contextual retrieval tasks.
Token compression alleviates excessive audio token issues in MLLM models.
Abstract
Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on images, largely overlooking audio, especially in the setting of interleaved audio-text contextual retrieval. In this work, we introduce the Audio-Text Interleaved contextual Retrieval (ATIR) task, where queries can alternate between audio and text modalities. We construct an ATIR benchmark by integrating several Automatic Speech Recognition (ASR), QA, and retrieval datasets, ultimately unifying four types of contextual retrieval tasks. This benchmark substantially addresses the limitations of existing audio retrieval datasets in semantic retrieval. To study this task, we evaluate several off-the-shelf retrievers and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
