Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs
Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Pawan Goyal, Manish Gupta

TL;DR
This paper introduces Multimodal Auto-Completion (MAC), a task that leverages visual context for real-time character prediction in dialogs, and proposes Router-Suggest, a dynamic model selector that improves efficiency and user satisfaction.
Contribution
The paper develops MAC as a new multimodal auto-completion task, adapts datasets, evaluates vision-language models, and introduces Router-Suggest for dynamic model selection, enhancing efficiency and user experience.
Findings
VLMs outperform textual models in user satisfaction and completion quality.
Router-Suggest achieves 2.3x to 10x speedup over the best VLM.
Multimodal context significantly improves auto-completion performance.
Abstract
Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
