Towards Good Practices for Very Deep Two-Stream ConvNets
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao

TL;DR
This paper introduces very deep two-stream convolutional networks for action recognition in videos, employing best practices and multi-GPU training to improve accuracy on UCF101 dataset.
Contribution
It adapts recent very deep architectures to video action recognition and proposes effective training practices for small datasets.
Findings
Achieved 91.4% accuracy on UCF101 dataset.
Demonstrated the effectiveness of deep architectures with good training practices.
Extended Caffe for efficient multi-GPU training.
Abstract
Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet, GoogLeNet), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely small compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. However, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Neural Network Applications
