F-LMM: Grounding Frozen Large Multimodal Models

Size Wu; Sheng Jin; Wenwei Zhang; Lumin Xu; Wentao Liu; Wei Li; Chen; Change Loy

arXiv:2406.05821·cs.CV·April 14, 2025

F-LMM: Grounding Frozen Large Multimodal Models

Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, Chen, Change Loy

PDF

Open Access 2 Repos 1 Models

TL;DR

F-LMM introduces a method to add visual grounding to frozen large multimodal models without fine-tuning, preserving their conversational abilities while enabling effective grounding and reasoning tasks.

Contribution

It proposes a simple approach to ground frozen LMMs using attention mechanisms and minimal additional training, avoiding catastrophic forgetting of conversational skills.

Findings

01

Achieves competitive grounding performance without fine-tuning LMMs.

02

Preserves original conversational and instruction-following abilities.

03

Enables complex reasoning and grounded conversation tasks.

Abstract

Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs' understanding of the visual world and their interaction with humans. However, existing methods typically fine-tune the parameters of LMMs to learn additional segmentation tokens and overfit grounding and segmentation datasets. Such a design would inevitably cause a catastrophic diminution in the indispensable conversational capability of general AI assistants. In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. To address this issue, we present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations -- a straightforward yet effective design based on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications · Advanced Computational Techniques and Applications