From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Grounded Open-vocabulary Situation Recognition

Chen Cai; Tianyi Liu; Jianjun Gao; Wenyang Liu; Kejun Wu; Ruoyu Wang; Yi Wang; Soo Chin Liew

arXiv:2507.14686·cs.CV·November 12, 2025

From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Grounded Open-vocabulary Situation Recognition

Chen Cai, Tianyi Liu, Jianjun Gao, Wenyang Liu, Kejun Wu, Ruoyu Wang, Yi Wang, Soo Chin Liew

PDF

Open Access

TL;DR

This paper introduces a novel knowledge distillation framework from multimodal large language models to small grounded situation recognition models, significantly improving their ability to recognize unseen and rare situations in an open-vocabulary setting.

Contribution

It proposes Multimodal Interactive Prompt Distillation (MIPD), a new framework that enhances generalization and zero-shot recognition in GSR models by distilling multimodal knowledge from foundation models.

Findings

01

Achieves superior performance on seen, rare, and unseen situations in the Ov-SWiG dataset.

02

Improves unseen detection capabilities on the HICO-DET dataset.

03

Enhances the recognition of rare and unseen situations in GSR models.

Abstract

Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Topic Modeling