Discovering A Variety of Objects in Spatio-Temporal Human-Object Interactions
Yong-Lu Li, Hongwei Fan, Zuoyu Qiu, Yiming Dou, Liang Xu, Hao-Shu, Fang, Peiyang Guo, Haisheng Su, Dongliang Wang, Wei Wu, Cewu Lu

TL;DR
This paper introduces a new benchmark for spatio-temporal human-object interaction detection with diverse objects, and proposes a Hierarchical Probe Network to improve object discovery using spatio-temporal cues, revealing current limitations.
Contribution
The paper presents a new AVA-based benchmark with 51 interactions and over 1,000 objects, and proposes a Hierarchical Probe Network for improved object discovery in videos.
Findings
HPN outperforms existing methods in object discovery tasks.
Current detectors struggle with localizing diverse/unseen objects.
The benchmark reveals limitations of current vision systems in complex interactions.
Abstract
Spatio-temporal Human-Object Interaction (ST-HOI) detection aims at detecting HOIs from videos, which is crucial for activity understanding. In daily HOIs, humans often interact with a variety of objects, e.g., holding and touching dozens of household items in cleaning. However, existing whole body-object interaction video benchmarks usually provide limited object classes. Here, we introduce a new benchmark based on AVA: Discovering Interacted Objects (DIO) including 51 interactions and 1,000+ objects. Accordingly, an ST-HOI learning task is proposed expecting vision systems to track human actors, detect interactions and simultaneously discover interacted objects. Even though today's detectors/trackers excel in object detection/tracking tasks, they perform unsatisfied to localize diverse/unseen objects in DIO. This profoundly reveals the limitation of current vision systems and poses a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Surveillance and Tracking Methods · Advanced Neural Network Applications
