Discovering A Variety of Objects in Spatio-Temporal Human-Object   Interactions

Yong-Lu Li; Hongwei Fan; Zuoyu Qiu; Yiming Dou; Liang Xu; Hao-Shu; Fang; Peiyang Guo; Haisheng Su; Dongliang Wang; Wei Wu; Cewu Lu

arXiv:2211.07501·cs.CV·November 21, 2022

Discovering A Variety of Objects in Spatio-Temporal Human-Object Interactions

Yong-Lu Li, Hongwei Fan, Zuoyu Qiu, Yiming Dou, Liang Xu, Hao-Shu, Fang, Peiyang Guo, Haisheng Su, Dongliang Wang, Wei Wu, Cewu Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new benchmark for spatio-temporal human-object interaction detection with diverse objects, and proposes a Hierarchical Probe Network to improve object discovery using spatio-temporal cues, revealing current limitations.

Contribution

The paper presents a new AVA-based benchmark with 51 interactions and over 1,000 objects, and proposes a Hierarchical Probe Network for improved object discovery in videos.

Findings

01

HPN outperforms existing methods in object discovery tasks.

02

Current detectors struggle with localizing diverse/unseen objects.

03

The benchmark reveals limitations of current vision systems in complex interactions.

Abstract

Spatio-temporal Human-Object Interaction (ST-HOI) detection aims at detecting HOIs from videos, which is crucial for activity understanding. In daily HOIs, humans often interact with a variety of objects, e.g., holding and touching dozens of household items in cleaning. However, existing whole body-object interaction video benchmarks usually provide limited object classes. Here, we introduce a new benchmark based on AVA: Discovering Interacted Objects (DIO) including 51 interactions and 1,000+ objects. Accordingly, an ST-HOI learning task is proposed expecting vision systems to track human actors, detect interactions and simultaneously discover interacted objects. Even though today's detectors/trackers excel in object detection/tracking tasks, they perform unsatisfied to localize diverse/unseen objects in DIO. This profoundly reveals the limitation of current vision systems and poses a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dirtyharrylyl/hake-ava
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Surveillance and Tracking Methods · Advanced Neural Network Applications