Smart Routing for Multimodal Video Retrieval: When to Search What

Kevin Dela Rosa

arXiv:2507.13374·cs.CV·July 21, 2025

Smart Routing for Multimodal Video Retrieval: When to Search What

Kevin Dela Rosa

PDF

Open Access

TL;DR

ModaRoute is an LLM-based system that intelligently routes multimodal video queries to optimize retrieval accuracy and efficiency, significantly reducing computational costs while maintaining high recall rates.

Contribution

We propose ModaRoute, an innovative LLM-driven routing system that dynamically selects the most relevant modalities for multimodal video retrieval, reducing computational overhead.

Findings

01

Achieves 60.9% Recall@5 with reduced computational cost

02

Reduces infrastructure costs by 41% through intelligent routing

03

Routes queries across multiple modalities, averaging 1.78 per query

Abstract

We introduce ModaRoute, an LLM-based intelligent routing system that dynamically selects optimal modalities for multimodal video retrieval. While dense text captions can achieve 75.9% Recall@5, they require expensive offline processing and miss critical visual information present in 34% of clips with scene text not captured by ASR. By analyzing query intent and predicting information needs, ModaRoute reduces computational overhead by 41% while achieving 60.9% Recall@5. Our approach uses GPT-4.1 to route queries across ASR (speech), OCR (text), and visual indices, averaging 1.78 modalities per query versus exhaustive 3.0 modality search. Evaluation on 1.8M video clips demonstrates that intelligent routing provides a practical solution for scaling multimodal retrieval systems, reducing infrastructure costs while maintaining competitive effectiveness for real-world deployment.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques