Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs

Sandeep Mishra; Devichand Budagam; Anubhab Mandal; Bishal Santra; Pawan Goyal; Manish Gupta

arXiv:2601.05851·cs.CL·January 12, 2026

Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs

Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Pawan Goyal, Manish Gupta

PDF

Open Access

TL;DR

This paper introduces Multimodal Auto-Completion (MAC), a task that leverages visual context for real-time character prediction in dialogs, and proposes Router-Suggest, a dynamic model selector that improves efficiency and user satisfaction.

Contribution

The paper develops MAC as a new multimodal auto-completion task, adapts datasets, evaluates vision-language models, and introduces Router-Suggest for dynamic model selection, enhancing efficiency and user experience.

Findings

01

VLMs outperform textual models in user satisfaction and completion quality.

02

Router-Suggest achieves 2.3x to 10x speedup over the best VLM.

03

Multimodal context significantly improves auto-completion performance.

Abstract

Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems