VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
Donggoo Kang, Dasol Jeong, Hyunmin Lee, Sangwoo Park, Hasil Park,, Sunkyu Kwon, Yeongjoon Kim, Joonki Paik

TL;DR
This paper introduces VLM-HOI, a novel method leveraging vision-language models' understanding to improve human-object interaction detection, achieving state-of-the-art results and enhancing interpretability.
Contribution
It is the first to utilize VLMs' language understanding as an objective for HOI detection, using image-text matching for contrastive optimization.
Findings
Achieves state-of-the-art HOI detection accuracy
Demonstrates effectiveness of language-based matching score
Enhances interpretability of human-object interaction analysis
Abstract
The Large Vision Language Model (VLM) has recently addressed remarkable progress in bridging two fundamental modalities. VLM, trained by a sufficiently large dataset, exhibits a comprehensive understanding of both visual and linguistic to perform diverse tasks. To distill this knowledge accurately, in this paper, we introduce a novel approach that explicitly utilizes VLM as an objective function form for the Human-Object Interaction (HOI) detection task (\textbf{VLM-HOI}). Specifically, we propose a method that quantifies the similarity of the predicted HOI triplet using the Image-Text matching technique. We represent HOI triplets linguistically to fully utilize the language comprehension of VLMs, which are more suitable than CLIP models due to their localization and object-centric nature. This matching score is used as an objective for contrastive optimization. To our knowledge, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
