Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?
Hirokatsu Kataoka, Tenga Wakamiya, Kensho Hara, Yutaka Satoh

TL;DR
This study explores how large-scale, carefully annotated video datasets and dataset combination strategies can enhance the performance of spatiotemporal 3D CNNs in video recognition tasks, emphasizing dataset construction over architecture alone.
Contribution
The paper demonstrates that large-scale, well-annotated datasets like Kinetics-700 improve 3D CNN performance and provides insights into dataset composition and merging strategies for better recognition accuracy.
Findings
Large-scale datasets improve 3D CNN accuracy.
Merging datasets like Kinetics-700 and Moments in Time enhances performance.
200-layer ResNet models benefit from larger pre-training datasets.
Abstract
How can we collect and use a video dataset to further improve spatiotemporal 3D Convolutional Neural Networks (3D CNNs)? In order to positively answer this open question in video recognition, we have conducted an exploration study using a couple of large-scale video datasets and 3D CNNs. In the early era of deep neural networks, 2D CNNs have been better than 3D CNNs in the context of video recognition. Recent studies revealed that 3D CNNs can outperform 2D CNNs trained on a large-scale video dataset. However, we heavily rely on architecture exploration instead of dataset consideration. Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy. We reveal that a carefully annotated dataset (e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Neural Network Applications
