CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization
Rui Xia, Dan Jiang, Quan Zhang, Ke Zhang, Chun Yuan

TL;DR
This paper introduces CLIP-AE, a novel unsupervised method for temporal action localization that leverages cross-view audio-visual information and visual language pre-training to improve boundary detection without requiring annotations.
Contribution
The paper proposes a CLIP-assisted cross-view audiovisual enhancement framework that combines visual language pre-training and audio perception for improved unsupervised temporal action localization.
Findings
Outperforms state-of-the-art methods on public datasets
Effectively utilizes audio cues for boundary detection
Reduces reliance on highly discriminative region focus
Abstract
Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
MethodsSoftmax · Attention Is All You Need · Focus
