An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set
Chaoyi Ai

TL;DR
This paper explores a training-free approach to Human-Object Interaction detection using multimodal visual foundation models when only test data is available, revealing limitations in open vocabulary capabilities and the impact of grounding methods.
Contribution
It introduces a novel testing scenario for HOI detection without training data, evaluating multimodal models' open vocabulary abilities in this setting.
Findings
Open vocabulary capabilities are not fully realized in current models.
Replacing feature extraction with grounding DINO confirms these limitations.
The approach provides insights into model performance without training.
Abstract
Human-Object Interaction (HOI) aims to identify the pairs of humans and objects in images and to recognize their relationships, ultimately forming triplets. Under default settings, HOI performance is nearly saturated, with many studies focusing on long-tail distribution and zero-shot/few-shot scenarios. Let us consider an intriguing problem:``What if there is only test dataset without training dataset, using multimodal visual foundation model in a training-free manner? '' This study uses two experimental settings: grounding truth and random arbitrary combinations. We get some interesting conclusion and find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized. Additionally, replacing the feature extraction with grounding DINO further confirms these findings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Geographic Information Systems Studies · Educational Tools and Methods
MethodsSoftmax · Dense Connections · Linear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Attention Is All You Need · Vision Transformer · self-DIstillation with NO labels
