Towards Resource-Efficient Multimodal Intelligence: Learned Routing among Specialized Expert Models
Mayank Saini, Arit Kumar Bishwas

TL;DR
This paper presents a modular, learned routing framework that efficiently directs queries to specialized models, reducing reliance on costly large models while maintaining high performance across multimodal tasks.
Contribution
Introduces a unified, learned routing system that dynamically allocates queries to specialized models, significantly reducing costs without sacrificing accuracy.
Findings
Achieves over 67% reduction in model reliance while matching or exceeding monolithic model performance.
Demonstrates effectiveness on benchmarks like MMLU and VQA.
Utilizes a two-stage vision pipeline optimized for efficiency.
Abstract
As AI moves beyond text, large language models (LLMs) increasingly power vision, audio, and document understanding; however, their high inference costs hinder real-time, scalable deployment. Conversely, smaller open-source models offer cost advantages but struggle with complex or multimodal queries. We introduce a unified, modular framework that intelligently routes each query - textual, multimodal, or complex - to the most fitting expert model, using a learned routing network that balances cost and quality. For vision tasks, we employ a two-stage open-source pipeline optimized for efficiency and reviving efficient classical vision components where they remain SOTA for sub-tasks. On benchmarks such as Massive Multitask Language Understanding (MMLU) and Visual Question Answering (VQA), we match or exceed the performance of always-premium LLM (monolithic systems with one model serving all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
