From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Grounded Open-vocabulary Situation Recognition
Chen Cai, Tianyi Liu, Jianjun Gao, Wenyang Liu, Kejun Wu, Ruoyu Wang, Yi Wang, Soo Chin Liew

TL;DR
This paper introduces a novel knowledge distillation framework from multimodal large language models to small grounded situation recognition models, significantly improving their ability to recognize unseen and rare situations in an open-vocabulary setting.
Contribution
It proposes Multimodal Interactive Prompt Distillation (MIPD), a new framework that enhances generalization and zero-shot recognition in GSR models by distilling multimodal knowledge from foundation models.
Findings
Achieves superior performance on seen, rare, and unseen situations in the Ov-SWiG dataset.
Improves unseen detection capabilities on the HICO-DET dataset.
Enhances the recognition of rare and unseen situations in GSR models.
Abstract
Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Topic Modeling
