Fusing Multi-Stream Deep Networks for Video Classification
Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, Xiangyang Xue, Jun Wang

TL;DR
This paper introduces a multi-stream deep learning framework for video classification that combines spatial, motion, and audio features with adaptive fusion, achieving state-of-the-art results on benchmark datasets.
Contribution
The paper presents a novel multi-stream architecture with adaptive fusion that leverages multimodal features and class relationships for improved video classification.
Findings
Achieved 92.2% accuracy on UCF-101 without audio.
Achieved 84.9% accuracy on Columbia CV.
Demonstrated superiority over existing methods.
Abstract
This paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional Neural Networks to model spatial, short-term motion and audio clues respectively. Long Short Term Memory networks are then adopted to explore long-term temporal dynamics. With the outputs of the individual streams, we propose a simple and effective fusion method to generate the final predictions, where the optimal fusion weights are learned adaptively for each class, and the learning process is regularized by automatically estimated class relationships. Our contributions are two-fold. First, the proposed multi-stream framework is able to exploit multimodal features that are more comprehensive than those previously attempted. Second, we demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Digital Media Forensic Detection
