Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object Interaction detection
Hongsheng Li, Guangming Zhu, Wu Zhen, Lan Ni, Peiyi Shen, Liang Zhang,, Ning Wang, Cong Hua

TL;DR
This paper introduces SPDTP, a novel spatio-temporal graph network with explicit spatial parsing and a learnable temporal module, significantly improving video human-object interaction detection performance.
Contribution
The paper proposes a new SPDTP network with explicit spatial parsing and a dynamic temporal module for better video HOI detection, achieving state-of-the-art results.
Findings
SPDTP outperforms existing methods on CAD-120 and Something-Else datasets.
Explicit spatial parsing improves interaction discrimination.
Dynamic temporal pooling emphasizes keyframes and reduces redundancy.
Abstract
The key of Human-Object Interaction(HOI) recognition is to infer the relationship between human and objects. Recently, the image's Human-Object Interaction(HOI) detection has made significant progress. However, there is still room for improvement in video HOI detection performance. Existing one-stage methods use well-designed end-to-end networks to detect a video segment and directly predict an interaction. It makes the model learning and further optimization of the network more complex. This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as a spatio-temporal graph with human and object nodes as input. Unlike existing methods, our proposed network predicts the difference between interactive and non-interactive pairs through explicit spatial parsing, and then performs interaction recognition. Moreover, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Visual Attention and Saliency Detection
