Open-World Human-Object Interaction Detection via Multi-modal Prompts

Jie Yang; Bingliang Li; Ailing Zeng; Lei Zhang; Ruimao Zhang

arXiv:2406.07221·cs.CV·June 12, 2024

Open-World Human-Object Interaction Detection via Multi-modal Prompts

Jie Yang, Bingliang Li, Ailing Zeng, Lei Zhang, Ruimao Zhang

PDF

Open Access

TL;DR

This paper introduces MP-HOI, a multi-modal prompt-based detector for open-world human-object interaction detection, leveraging large-scale datasets and contrastive learning to improve generalization, zero-shot performance, and handle ambiguous interactions.

Contribution

The paper presents MP-HOI, a novel multi-modal prompt-based approach that integrates visual prompts and large-scale datasets for improved open-world HOI detection.

Findings

01

Surpasses existing models by over 30 times in HOI vocabulary coverage.

02

Achieves state-of-the-art results across multiple benchmarks.

03

Exhibits strong zero-shot generalization capabilities.

Abstract

In this paper, we develop \textbf{MP-HOI}, a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions, realizing HOI detection in the open world. Specifically, it integrates visual prompts into existing language-guided-only HOI detectors to handle situations where textual descriptions face difficulties in generalization and to address complex scenarios with high interaction ambiguity. To facilitate MP-HOI training, we build a large-scale HOI dataset named Magic-HOI, which gathers six existing datasets into a unified label space, forming over 186K images with 2.4K objects, 1.2K actions, and 20K HOI interactions. Furthermore, to tackle the long-tail issue within the Magic-HOI dataset, we introduce an automated pipeline for generating realistically annotated HOI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Speech and dialogue systems