Efficient and High-Fidelity Omni Modality Retrieval

Chuong Huynh; Manh Luong; Abhinav Shrivastava

arXiv:2603.02098·cs.IR·March 25, 2026

Efficient and High-Fidelity Omni Modality Retrieval

Chuong Huynh, Manh Luong, Abhinav Shrivastava

PDF

Open Access

TL;DR

OmniRet is a novel retrieval model capable of handling complex queries across text, vision, and audio modalities, addressing efficiency and fidelity challenges to improve multi-modal retrieval performance.

Contribution

This paper introduces OmniRet, the first model to perform universal multimodal retrieval across three modalities with innovative attention-based resampling and pooling techniques.

Findings

01

Significant improvements on composed query, audio, and video retrieval tasks.

02

Achieved on-par performance with state-of-the-art models on other tasks.

03

Curated a new Audio-Centric Multimodal Benchmark (ACM).

Abstract

Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech Recognition and Synthesis