Selective Volume Mixup for Video Action Recognition
Yi Tan, Zhaofan Qiu, Yanbin Hao, Ting Yao, Tao Mei

TL;DR
This paper introduces Selective Volume Mixup (SV-Mix), a novel video augmentation technique that adaptively combines informative video volumes to enhance model generalization on small datasets.
Contribution
The paper proposes a learnable selective augmentation strategy with spatial and temporal modules, optimized jointly with recognition models to improve video classification performance.
Findings
SV-Mix improves accuracy on multiple benchmarks.
It benefits both CNN and transformer models.
It enhances generalization on small datasets.
Abstract
The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for each frame individually including Mixup, Cutmix, and RandAugment, which are not particularly optimized for video data. In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos. SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes the volumes up to achieve a new training video. Technically, we propose two new modules,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Stroke Rehabilitation and Recovery
MethodsMixup · RandAugment
