TL;DR
LatentRouter is a novel routing method for multimodal large language models that predicts model utility based on input features, enabling dynamic selection tailored to multimodal task requirements.
Contribution
It introduces a counterfactual utility prediction approach with latent communication for improved model routing in multimodal tasks.
Findings
LatentRouter outperforms fixed-model and baseline routers on benchmark datasets.
Gains are most significant on tasks requiring visual, layout-sensitive, or reasoning skills.
Latent communication between model states is key to the improved performance.
Abstract
Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
