MVFNet: Multi-View Fusion Network for Efficient Video Recognition
Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, Errui Ding

TL;DR
MVFNet introduces a multi-view fusion approach using 2D CNNs to efficiently model video dynamics from multiple planes, achieving state-of-the-art results in action recognition benchmarks.
Contribution
The paper proposes a novel multi-view fusion module for 2D CNNs that captures video dynamics from multiple planes, enhancing efficiency and effectiveness in video recognition.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Maintains low complexity comparable to 2D CNNs.
Generalizes several existing video modeling methods.
Abstract
Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Diabetic Foot Ulcer Assessment and Management · Anomaly Detection Techniques and Applications
MethodsConvolution
