F-LMM: Grounding Frozen Large Multimodal Models
Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, Chen, Change Loy

TL;DR
F-LMM introduces a method to add visual grounding to frozen large multimodal models without fine-tuning, preserving their conversational abilities while enabling effective grounding and reasoning tasks.
Contribution
It proposes a simple approach to ground frozen LMMs using attention mechanisms and minimal additional training, avoiding catastrophic forgetting of conversational skills.
Findings
Achieves competitive grounding performance without fine-tuning LMMs.
Preserves original conversational and instruction-following abilities.
Enables complex reasoning and grounded conversation tasks.
Abstract
Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs' understanding of the visual world and their interaction with humans. However, existing methods typically fine-tune the parameters of LMMs to learn additional segmentation tokens and overfit grounding and segmentation datasets. Such a design would inevitably cause a catastrophic diminution in the indispensable conversational capability of general AI assistants. In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. To address this issue, we present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations -- a straightforward yet effective design based on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Advanced Computational Techniques and Applications
