Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object   Interaction detection

Hongsheng Li; Guangming Zhu; Wu Zhen; Lan Ni; Peiyi Shen; Liang Zhang,; Ning Wang; Cong Hua

arXiv:2206.03061·cs.CV·June 8, 2022·1 cites

Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object Interaction detection

Hongsheng Li, Guangming Zhu, Wu Zhen, Lan Ni, Peiyi Shen, Liang Zhang,, Ning Wang, Cong Hua

PDF

Open Access

TL;DR

This paper introduces SPDTP, a novel spatio-temporal graph network with explicit spatial parsing and a learnable temporal module, significantly improving video human-object interaction detection performance.

Contribution

The paper proposes a new SPDTP network with explicit spatial parsing and a dynamic temporal module for better video HOI detection, achieving state-of-the-art results.

Findings

01

SPDTP outperforms existing methods on CAD-120 and Something-Else datasets.

02

Explicit spatial parsing improves interaction discrimination.

03

Dynamic temporal pooling emphasizes keyframes and reduces redundancy.

Abstract

The key of Human-Object Interaction(HOI) recognition is to infer the relationship between human and objects. Recently, the image's Human-Object Interaction(HOI) detection has made significant progress. However, there is still room for improvement in video HOI detection performance. Existing one-stage methods use well-designed end-to-end networks to detect a video segment and directly predict an interaction. It makes the model learning and further optimization of the network more complex. This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as a spatio-temporal graph with human and object nodes as input. Unlike existing methods, our proposed network predicts the difference between interactive and non-interactive pairs through explicit spatial parsing, and then performs interaction recognition. Moreover, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Visual Attention and Saliency Detection