Cross-modal Consensus Network for Weakly Supervised Temporal Action   Localization

Fa-Ting Hong; Jia-Chang Feng; Dan Xu; Ying Shan; Wei-Shi Zheng

arXiv:2107.12589·cs.CV·July 28, 2021·1 cites

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, Wei-Shi Zheng

PDF

Open Access 2 Repos

TL;DR

This paper introduces CO2-Net, a novel cross-modal consensus network that enhances weakly supervised temporal action localization by using attention mechanisms and mutual learning to produce more representative features, achieving state-of-the-art results.

Contribution

The paper proposes a cross-modal consensus network with attention modules and mutual learning for improved feature calibration in weakly supervised temporal action localization.

Findings

01

Achieves state-of-the-art results on THUMOS14 and ActivityNet1.2 datasets.

02

The cross-modal consensus module effectively filters task-irrelevant information.

03

Mutual learning between modules maintains prediction consistency.

Abstract

Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Both appearance and motion features are used in previous works, while they do not utilize them in a proper way but apply simple concatenation or score-level fusion. In this work, we argue that the features extracted from the pretrained extractor, e.g., I3D, are not the WS-TALtask-specific features, thus the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Therefore, we propose a cross-modal consensus network (CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications