From Category to Scenery: An End-to-End Framework for Multi-Person Human-Object Interaction Recognition in Videos
Tanqiu Qiao, Ruochen Li, Frederick W. B. Li, Hubert P. H. Shum

TL;DR
This paper introduces CATS, an end-to-end framework that combines geometric and visual features in a graph-based model to improve multi-person human-object interaction recognition in videos, achieving state-of-the-art results.
Contribution
The novel CATS framework effectively integrates geometric and visual features through graph modeling, advancing the understanding of complex human-object interactions in videos.
Findings
Achieves state-of-the-art performance on MPHOI-72 and CAD-120 datasets.
Effectively models dynamic relationships between humans and objects.
Bridges category-specific insights with scenery dynamics.
Abstract
Video-based Human-Object Interaction (HOI) recognition explores the intricate dynamics between humans and objects, which are essential for a comprehensive understanding of human behavior and intentions. While previous work has made significant strides, effectively integrating geometric and visual features to model dynamic relationships between humans and objects in a graph framework remains a challenge. In this work, we propose a novel end-to-end category to scenery framework, CATS, starting by generating geometric features for various categories through graphs respectively, then fusing them with corresponding visual features. Subsequently, we construct a scenery interactive graph with these enhanced geometric-visual features as nodes to learn the relationships among human and object categories. This methodological advance facilitates a deeper, more structured comprehension of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Multimodal Machine Learning Applications
