TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning
Aritra Bhowmik, Mohammad Mahdi Derakhshani, Dennis Koelma, Yuki M., Asano, Martin R. Oswald, Cees G. M. Snoek

TL;DR
This paper introduces TWIST & SCOUT, a framework that enhances multimodal large language models with visual grounding abilities without losing their existing skills, using twin-expert tuning and synthetic reasoning datasets.
Contribution
The paper presents a novel twin-expert tuning method and a synthetic dataset, SCOUT, enabling multimodal models to learn visual grounding without forgetting prior knowledge.
Findings
Strong performance on multiple visual grounding benchmarks
Retains pre-trained image understanding skills
Effective stepwise fine-tuning with synthetic data
Abstract
Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Multimodal Large Language Models (MLLMs) struggle at this task. In this paper, we introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability without forgetting their existing image and language understanding skills. To this end, we propose TWIST, a twin-expert stepwise tuning module that modifies the decoder of the language model using one frozen module pre-trained on image understanding tasks and another learnable one for visual grounding tasks. This allows the MLLM to retain previously learned knowledge and skills, while acquiring what is missing. To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT, which mimics human reasoning in visual grounding. This dataset provides rich…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptical Wireless Communication Technologies · Optical Network Technologies
MethodsMixture of Experts
