TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

Aritra Bhowmik; Mohammad Mahdi Derakhshani; Dennis Koelma; Yuki M.; Asano; Martin R. Oswald; Cees G. M. Snoek

arXiv:2410.10491·cs.CV·March 21, 2025

TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

Aritra Bhowmik, Mohammad Mahdi Derakhshani, Dennis Koelma, Yuki M., Asano, Martin R. Oswald, Cees G. M. Snoek

PDF

Open Access

TL;DR

This paper introduces TWIST & SCOUT, a framework that enhances multimodal large language models with visual grounding abilities without losing their existing skills, using twin-expert tuning and synthetic reasoning datasets.

Contribution

The paper presents a novel twin-expert tuning method and a synthetic dataset, SCOUT, enabling multimodal models to learn visual grounding without forgetting prior knowledge.

Findings

01

Strong performance on multiple visual grounding benchmarks

02

Retains pre-trained image understanding skills

03

Effective stepwise fine-tuning with synthetic data

Abstract

Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Multimodal Large Language Models (MLLMs) struggle at this task. In this paper, we introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability without forgetting their existing image and language understanding skills. To this end, we propose TWIST, a twin-expert stepwise tuning module that modifies the decoder of the language model using one frozen module pre-trained on image understanding tasks and another learnable one for visual grounding tasks. This allows the MLLM to retain previously learned knowledge and skills, while acquiring what is missing. To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT, which mimics human reasoning in visual grounding. This dataset provides rich…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptical Wireless Communication Technologies · Optical Network Technologies

MethodsMixture of Experts