Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs
Zhaolin Cai, Huiyu Duan, Zitong Xu, Fan Li, Zhi Liu, Jing Liu, Wei Shen, Xiongkuo Min, Guangtao Zhai

TL;DR
This paper introduces \\GRASP-HO, a novel framework that transforms HOI detection into an open-vocabulary generative task using a cognitive steering module to integrate visual evidence with large language models, enabling better generalization.
Contribution
It presents a new generative reasoning framework for HOI detection that bridges vision and language models, improving open-world and zero-shot performance.
Findings
Achieves state-of-the-art closed-set HOI detection performance.
Demonstrates strong zero-shot generalization to unseen interactions.
Provides a unified perception and reasoning paradigm for HOI detection.
Abstract
Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them. Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set, which struggles to generalize to the long-tail of unseen or ambiguous interactions in the wild. While recent multi-modal large language models (MLLMs) possess the rich world knowledge required for open-vocabulary understanding, they remain decoupled from existing HOI detectors since fine-tuning them is computationally prohibitive. To address these constraints, we propose \GRASP-HO}, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem. To bridge the vision and cognitive, we first extract hybrid interaction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Neural Network Applications
