Toward Accurate Person-level Action Recognition in Videos of Crowded Scenes
Li Yuan, Yichen Zhou, Shuning Chang, Ziyuan Huang, Yunpeng Chen,, Xuecheng Nie, Tao Wang, Jiashi Feng, Shuicheng Yan

TL;DR
This paper advances person-level action recognition in crowded videos by integrating scene information and new diverse data, significantly improving accuracy and generalization in complex environments.
Contribution
It introduces a top-down approach combining strong human detection, semantic scene segmentation, and new data collection to enhance recognition in crowded scenes.
Findings
Achieved 26.05 wf_mAP on the HIE dataset.
Ranked 1st in ACM MM 2020 Human in Events challenge.
Enhanced model generalization with diverse internet data.
Abstract
Detecting and recognizing human action in videos with crowded scenes is a challenging problem due to the complex environment and diversity events. Prior works always fail to deal with this problem in two aspects: (1) lacking utilizing information of the scenes; (2) lacking training data in the crowd and complex scenes. In this paper, we focus on improving spatio-temporal action recognition by fully-utilizing the information of scenes and collecting new data. A top-down strategy is used to overcome the limitations. Specifically, we adopt a strong human detector to detect the spatial location of each frame. We then apply action recognition models to learn the spatio-temporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet, which can improve the generalization ability of our model. Besides, the scenes information is extracted by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
