Spatio-Temporal Action Detection with Multi-Object Interaction
Huijuan Xu, Lizhi Yang, Stan Sclaroff, Kate Saenko, Trevor Darrell

TL;DR
This paper introduces a new dataset and an end-to-end model for spatio-temporal action detection involving multiple objects, addressing limitations of existing methods that focus on single-person actions.
Contribution
The paper presents a novel dataset with multi-object interaction annotations and an end-to-end model capable of detecting complex multi-object actions in videos.
Findings
Proposed a new multi-object interaction dataset.
Developed an end-to-end spatio-temporal detection model.
Achieved competitive results on UCF101-24 benchmark.
Abstract
Spatio-temporal action detection in videos requires localizing the action both spatially and temporally in the form of an "action tube". Nowadays, most spatio-temporal action detection datasets (e.g. UCF101-24, AVA, DALY) are annotated with action tubes that contain a single person performing the action, thus the predominant action detection models simply employ a person detection and tracking pipeline for localization. However, when the action is defined as an interaction between multiple objects, such methods may fail since each bounding box in the action tube contains multiple objects instead of one person. In this paper, we study the spatio-temporal action detection problem with multi-object interaction. We introduce a new dataset that is annotated with action tubes containing multi-object interactions. Moreover, we propose an end-to-end spatio-temporal action detection model that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Surveillance and Tracking Methods
