Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video   Classification

Yu-Gang Jiang; Zuxuan Wu; Jinhui Tang; Zechao Li; Xiangyang Xue,; Shih-Fu Chang

arXiv:1706.04508·cs.MM·June 15, 2017·6 cites

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue,, Shih-Fu Chang

PDF

Open Access

TL;DR

This paper presents a hybrid deep learning framework that integrates multimodal cues from appearance, motion, and audio, along with temporal and semantic context modeling, to significantly improve video classification accuracy.

Contribution

It introduces a comprehensive multimodal deep learning framework combining CNNs, LSTMs, feature fusion, and semantic refinement for enhanced video categorization.

Findings

01

Achieves 93.1% accuracy on UCF-101

02

Achieves 84.5% accuracy on CCV

03

Demonstrates the effectiveness of multimodal and contextual modeling

Abstract

Videos are inherently multimodal. This paper studies the problem of how to fully exploit the abundant multimodal clues for improved video categorization. We introduce a hybrid deep learning framework that integrates useful clues from multiple modalities, including static spatial appearance information, motion patterns within a short time window, audio information as well as long-range temporal dynamics. More specifically, we utilize three Convolutional Neural Networks (CNNs) operating on appearance, motion and audio signals to extract their corresponding features. We then employ a feature fusion network to derive a unified representation with an aim to capture the relationships among features. Furthermore, to exploit the long-range temporal dynamics in videos, we apply two Long Short Term Memory networks with extracted appearance and motion features as inputs. Finally, we also propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Music and Audio Processing

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory