Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models
Liulei Li, Wenguan Wang, Yi Yang

TL;DR
This paper introduces DIFFUSIONHOI, a novel human-object interaction detection method leveraging diffusion models' strengths in visual concept recognition and compositionality, achieving state-of-the-art results without heavy fine-tuning.
Contribution
The paper proposes a new HOI detection approach using diffusion models with an inversion-based strategy for relation embedding learning, enabling effective zero-shot and regular performance.
Findings
Achieves state-of-the-art results on three datasets.
Effective in zero-shot HOI detection scenarios.
Utilizes diffusion models to better capture mid/low-level visual cues.
Abstract
Prevalent human-object interaction (HOI) detection approaches typically leverage large-scale visual-linguistic models to help recognize events involving humans and objects. Though promising, models trained via contrastive learning on text-image pairs often neglect mid/low-level visual cues and struggle at compositional reasoning. In response, we introduce DIFFUSIONHOI, a new HOI detector shedding light on text-to-image diffusion models. Unlike the aforementioned models, diffusion models excel in discerning mid/low-level visual concepts as generative models, and possess strong compositionality to handle novel concepts expressed in text inputs. Considering diffusion models usually emphasize instance objects, we first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Video Surveillance and Tracking Methods
MethodsDiffusion · Contrastive Learning
