CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models
Jianjun Gao, Chen Cai, Ruoyu Wang, Wenyang Liu, Kim-Hui Yap, Kratika, Garg, Boon-Siew Han

TL;DR
This paper introduces CL-HOI, a framework that distills instance-level human-object interactions from vision large language models at the image level, reducing reliance on manual annotations and improving detection accuracy.
Contribution
The proposed CL-HOI framework uniquely combines context and interaction distillation from VLLMs to enable instance-level HOI detection without manual labels.
Findings
Outperforms existing weakly supervised methods on HICO-DET and V-COCO datasets.
Effectively transfers image-level knowledge to instance-level HOI detection.
Reduces dependence on manual annotations for HOI detection.
Abstract
Human-object interaction (HOI) detection has seen advancements with Vision Language Models (VLMs), but these methods often depend on extensive manual annotations. Vision Large Language Models (VLLMs) can inherently recognize and reason about interactions at the image level but are computationally heavy and not designed for instance-level HOI detection. To overcome these limitations, we propose a Cross-Level HOI distillation (CL-HOI) framework, which distills instance-level HOIs from VLLMs image-level understanding without the need for manual annotations. Our approach involves two stages: context distillation, where a Visual Linguistic Translator (VLT) converts visual information into linguistic form, and interaction distillation, where an Interaction Cognition Network (ICN) reasons about spatial, visual, and context relations. We design contrastive distillation losses to transfer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
