Probabilistic Vision-Language Representation for Weakly Supervised   Temporal Action Localization

Geuntaek Lim; Hyunwoo Kim; Joonsoo Kim; Yukyung Choi

arXiv:2408.05955·cs.CV·August 13, 2024

Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Geuntaek Lim, Hyunwoo Kim, Joonsoo Kim, Yukyung Choi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a probabilistic vision-language framework for weakly supervised temporal action localization, effectively aligning human action and VLP knowledge in a joint space to improve detection accuracy.

Contribution

It proposes a novel probabilistic embedding space and contrastive learning methods to better capture fine-grained human motions and align action knowledge with VLP.

Findings

01

Outperforms previous state-of-the-art methods

02

Significant improvement in localization accuracy

03

Effective alignment of action and VLP knowledge

Abstract

Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sejong-rcv/pvlr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems

MethodsContrastive Learning