CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization

Rui Xia; Dan Jiang; Quan Zhang; Ke Zhang; Chun Yuan

arXiv:2505.23524·cs.CV·June 6, 2025

CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization

Rui Xia, Dan Jiang, Quan Zhang, Ke Zhang, Chun Yuan

PDF

Open Access

TL;DR

This paper introduces CLIP-AE, a novel unsupervised method for temporal action localization that leverages cross-view audio-visual information and visual language pre-training to improve boundary detection without requiring annotations.

Contribution

The paper proposes a CLIP-assisted cross-view audiovisual enhancement framework that combines visual language pre-training and audio perception for improved unsupervised temporal action localization.

Findings

01

Outperforms state-of-the-art methods on public datasets

02

Effectively utilizes audio cues for boundary detection

03

Reduces reliance on highly discriminative region focus

Abstract

Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need · Focus