Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition
Ning Wang, Guangming Zhu, Liang Zhang, Peiyi Shen, Hongsheng Li, Cong, Hua

TL;DR
This paper introduces STIGPN, a graph-based neural network that models spatio-temporal relationships in videos to improve human-object interaction recognition, leveraging multi-modal features and a multi-stream fusion strategy.
Contribution
The paper proposes a novel spatio-temporal interaction graph parsing network that captures position changes and long-range dependencies for better video-based interaction understanding.
Findings
Achieves state-of-the-art results on CAD-120 and Something-Else datasets.
Effectively models inter-frame and intra-frame relationships.
Enhances recognition accuracy with multi-modal and multi-stream fusion.
Abstract
For a given video-based Human-Object Interaction scene, modeling the spatio-temporal relationship between humans and objects are the important cue to understand the contextual information presented in the video. With the effective spatio-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also to directly capture inter-time dependencies. It is more critical to capture the position changes of human and objects over the spatio-temporal dimension when their appearance features may not show up significant changes over time. The full use of appearance features, the spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance. In this paper, Spatio-Temporal Interaction Graph Parsing Networks (STIGPN) are constructed, which encode the videos with a graph…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
