VLM-HOI: Vision Language Models for Interpretable Human-Object   Interaction Analysis

Donggoo Kang; Dasol Jeong; Hyunmin Lee; Sangwoo Park; Hasil Park,; Sunkyu Kwon; Yeongjoon Kim; Joonki Paik

arXiv:2411.18038·cs.CV·November 28, 2024

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

Donggoo Kang, Dasol Jeong, Hyunmin Lee, Sangwoo Park, Hasil Park,, Sunkyu Kwon, Yeongjoon Kim, Joonki Paik

PDF

Open Access

TL;DR

This paper introduces VLM-HOI, a novel method leveraging vision-language models' understanding to improve human-object interaction detection, achieving state-of-the-art results and enhancing interpretability.

Contribution

It is the first to utilize VLMs' language understanding as an objective for HOI detection, using image-text matching for contrastive optimization.

Findings

01

Achieves state-of-the-art HOI detection accuracy

02

Demonstrates effectiveness of language-based matching score

03

Enhances interpretability of human-object interaction analysis

Abstract

The Large Vision Language Model (VLM) has recently addressed remarkable progress in bridging two fundamental modalities. VLM, trained by a sufficiently large dataset, exhibits a comprehensive understanding of both visual and linguistic to perform diverse tasks. To distill this knowledge accurately, in this paper, we introduce a novel approach that explicitly utilizes VLM as an objective function form for the Human-Object Interaction (HOI) detection task (\textbf{VLM-HOI}). Specifically, we propose a method that quantifies the similarity of the predicted HOI triplet using the Image-Text matching technique. We represent HOI triplets linguistically to fully utilize the language comprehension of VLMs, which are more suitable than CLIP models due to their localization and object-centric nature. This matching score is used as an objective for contrastive optimization. To our knowledge, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training