Efficient and High-Fidelity Omni Modality Retrieval
Chuong Huynh, Manh Luong, Abhinav Shrivastava

TL;DR
OmniRet is a novel retrieval model capable of handling complex queries across text, vision, and audio modalities, addressing efficiency and fidelity challenges to improve multi-modal retrieval performance.
Contribution
This paper introduces OmniRet, the first model to perform universal multimodal retrieval across three modalities with innovative attention-based resampling and pooling techniques.
Findings
Significant improvements on composed query, audio, and video retrieval tasks.
Achieved on-par performance with state-of-the-art models on other tasks.
Curated a new Audio-Centric Multimodal Benchmark (ACM).
Abstract
Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech Recognition and Synthesis
