Spatial-Temporal Human-Object Interaction Detection
Xu Sun, Yunqing He, Tongwei Ren, Gangshan Wu

TL;DR
This paper introduces a new task called ST-HOID for detecting fine-grained human-object interactions in videos, along with a novel method and a dataset for evaluation, advancing human-centric video understanding.
Contribution
It proposes the first dataset VidOR-HOID and a novel method combining object trajectory detection and interaction reasoning for ST-HOID.
Findings
Our method outperforms existing baselines.
The VidOR-HOID dataset enables comprehensive evaluation.
Experimental results show significant improvement over state-of-the-art methods.
Abstract
In this paper, we propose a new instance-level human-object interaction detection task on videos called ST-HOID, which aims to distinguish fine-grained human-object interactions (HOIs) and the trajectories of subjects and objects. It is motivated by the fact that HOI is crucial for human-centric video content understanding. To solve ST-HOID, we propose a novel method consisting of an object trajectory detection module and an interaction reasoning module. Furthermore, we construct the first dataset named VidOR-HOID for ST-HOID evaluation, which contains 10,831 spatial-temporal HOI instances. We conduct extensive experiments to evaluate the effectiveness of our method. The experimental results demonstrate that our method outperforms the baselines generated by the state-of-the-art methods of image human-object interaction detection, video visual relation detection and video human-object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
