An analysis of HOI: using a training-free method with multimodal visual   foundation models when only the test set is available, without the training   set

Chaoyi Ai

arXiv:2408.05772·cs.CV·August 13, 2024

An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set

Chaoyi Ai

PDF

Open Access

TL;DR

This paper explores a training-free approach to Human-Object Interaction detection using multimodal visual foundation models when only test data is available, revealing limitations in open vocabulary capabilities and the impact of grounding methods.

Contribution

It introduces a novel testing scenario for HOI detection without training data, evaluating multimodal models' open vocabulary abilities in this setting.

Findings

01

Open vocabulary capabilities are not fully realized in current models.

02

Replacing feature extraction with grounding DINO confirms these limitations.

03

The approach provides insights into model performance without training.

Abstract

Human-Object Interaction (HOI) aims to identify the pairs of humans and objects in images and to recognize their relationships, ultimately forming $⟨ h u man, o bj ec t, v er b ⟩$ triplets. Under default settings, HOI performance is nearly saturated, with many studies focusing on long-tail distribution and zero-shot/few-shot scenarios. Let us consider an intriguing problem:``What if there is only test dataset without training dataset, using multimodal visual foundation model in a training-free manner? '' This study uses two experimental settings: grounding truth and random arbitrary combinations. We get some interesting conclusion and find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized. Additionally, replacing the feature extraction with grounding DINO further confirms these findings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Geographic Information Systems Studies · Educational Tools and Methods

MethodsSoftmax · Dense Connections · Linear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Attention Is All You Need · Vision Transformer · self-DIstillation with NO labels