Exploiting Temporal Coherence for Multi-modal Video Categorization
Palash Goyal, Saurabh Sahu, Shalini Ghosh, Chul Lee

TL;DR
This paper introduces a novel temporal coherence-based regularization method for multimodal video categorization, improving model performance across various architectures like RNNs, NetVLAD, and Transformers.
Contribution
The paper proposes a new temporal coherence regularization technique applicable to multiple model types for enhanced multimodal video categorization.
Findings
Outperforms state-of-the-art baseline models
Effective across different model architectures
Improves accuracy in video content analysis
Abstract
Multimodal ML models can process data in multiple modalities (e.g., video, images, audio, text) and are useful for video content analysis in a variety of problems (e.g., object detection, scene understanding). In this paper, we focus on the problem of video categorization by using a multimodal approach. We have developed a novel temporal coherence-based regularization approach, which applies to different types of models (e.g., RNN, NetVLAD, Transformer). We demonstrate through experiments how our proposed multimodal video categorization models with temporal coherence out-perform strong state-of-the-art baseline models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
