Open-World Human-Object Interaction Detection via Multi-modal Prompts
Jie Yang, Bingliang Li, Ailing Zeng, Lei Zhang, Ruimao Zhang

TL;DR
This paper introduces MP-HOI, a multi-modal prompt-based detector for open-world human-object interaction detection, leveraging large-scale datasets and contrastive learning to improve generalization, zero-shot performance, and handle ambiguous interactions.
Contribution
The paper presents MP-HOI, a novel multi-modal prompt-based approach that integrates visual prompts and large-scale datasets for improved open-world HOI detection.
Findings
Surpasses existing models by over 30 times in HOI vocabulary coverage.
Achieves state-of-the-art results across multiple benchmarks.
Exhibits strong zero-shot generalization capabilities.
Abstract
In this paper, we develop \textbf{MP-HOI}, a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions, realizing HOI detection in the open world. Specifically, it integrates visual prompts into existing language-guided-only HOI detectors to handle situations where textual descriptions face difficulties in generalization and to address complex scenarios with high interaction ambiguity. To facilitate MP-HOI training, we build a large-scale HOI dataset named Magic-HOI, which gathers six existing datasets into a unified label space, forming over 186K images with 2.4K objects, 1.2K actions, and 20K HOI interactions. Furthermore, to tackle the long-tail issue within the Magic-HOI dataset, we introduce an automated pipeline for generating realistically annotated HOI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Speech and dialogue systems
