Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization
Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, Wei-Shi Zheng

TL;DR
This paper introduces CO2-Net, a novel cross-modal consensus network that enhances weakly supervised temporal action localization by using attention mechanisms and mutual learning to produce more representative features, achieving state-of-the-art results.
Contribution
The paper proposes a cross-modal consensus network with attention modules and mutual learning for improved feature calibration in weakly supervised temporal action localization.
Findings
Achieves state-of-the-art results on THUMOS14 and ActivityNet1.2 datasets.
The cross-modal consensus module effectively filters task-irrelevant information.
Mutual learning between modules maintains prediction consistency.
Abstract
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Both appearance and motion features are used in previous works, while they do not utilize them in a proper way but apply simple concatenation or score-level fusion. In this work, we argue that the features extracted from the pretrained extractor, e.g., I3D, are not the WS-TALtask-specific features, thus the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Therefore, we propose a cross-modal consensus network (CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
