Smart Routing for Multimodal Video Retrieval: When to Search What
Kevin Dela Rosa

TL;DR
ModaRoute is an LLM-based system that intelligently routes multimodal video queries to optimize retrieval accuracy and efficiency, significantly reducing computational costs while maintaining high recall rates.
Contribution
We propose ModaRoute, an innovative LLM-driven routing system that dynamically selects the most relevant modalities for multimodal video retrieval, reducing computational overhead.
Findings
Achieves 60.9% Recall@5 with reduced computational cost
Reduces infrastructure costs by 41% through intelligent routing
Routes queries across multiple modalities, averaging 1.78 per query
Abstract
We introduce ModaRoute, an LLM-based intelligent routing system that dynamically selects optimal modalities for multimodal video retrieval. While dense text captions can achieve 75.9% Recall@5, they require expensive offline processing and miss critical visual information present in 34% of clips with scene text not captured by ASR. By analyzing query intent and predicting information needs, ModaRoute reduces computational overhead by 41% while achieving 60.9% Recall@5. Our approach uses GPT-4.1 to route queries across ASR (speech), OCR (text), and visual indices, averaging 1.78 modalities per query versus exhaustive 3.0 modality search. Evaluation on 1.8M video clips demonstrates that intelligent routing provides a practical solution for scaling multimodal retrieval systems, reducing infrastructure costs while maintaining competitive effectiveness for real-world deployment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
