Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification
Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue,, Shih-Fu Chang

TL;DR
This paper presents a hybrid deep learning framework that integrates multimodal cues from appearance, motion, and audio, along with temporal and semantic context modeling, to significantly improve video classification accuracy.
Contribution
It introduces a comprehensive multimodal deep learning framework combining CNNs, LSTMs, feature fusion, and semantic refinement for enhanced video categorization.
Findings
Achieves 93.1% accuracy on UCF-101
Achieves 84.5% accuracy on CCV
Demonstrates the effectiveness of multimodal and contextual modeling
Abstract
Videos are inherently multimodal. This paper studies the problem of how to fully exploit the abundant multimodal clues for improved video categorization. We introduce a hybrid deep learning framework that integrates useful clues from multiple modalities, including static spatial appearance information, motion patterns within a short time window, audio information as well as long-range temporal dynamics. More specifically, we utilize three Convolutional Neural Networks (CNNs) operating on appearance, motion and audio signals to extract their corresponding features. We then employ a feature fusion network to derive a unified representation with an aim to capture the relationships among features. Furthermore, to exploit the long-range temporal dynamics in videos, we apply two Long Short Term Memory networks with extracted appearance and motion features as inputs. Finally, we also propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Music and Audio Processing
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
