Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models

Quan Zhang; Jinwei Fang; Rui Yuan; Xi Tang; Yuxin Qi; Ke Zhang; Chun Yuan

arXiv:2411.08466·cs.CV·June 10, 2025

Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models

Quan Zhang, Jinwei Fang, Rui Yuan, Xi Tang, Yuxin Qi, Ke Zhang, Chun Yuan

PDF

Open Access

TL;DR

This paper introduces MLLM4WTAL, a novel paradigm that leverages multimodal large language models to enhance weakly supervised temporal action localization by providing semantic priors and addressing common prediction issues.

Contribution

The paper proposes a new learning framework that integrates MLLMs with WTAL methods, introducing modules for semantic matching and reconstruction to improve localization accuracy.

Findings

01

Significant performance improvements on WTAL benchmarks.

02

Effective handling of incomplete and over-complete localization results.

03

Versatile enhancement across different WTAL models.

Abstract

Recent breakthroughs in Multimodal Large Language Models (MLLMs) have gained significant recognition within the deep learning community, where the fusion of the Video Foundation Models (VFMs) and Large Language Models(LLMs) has proven instrumental in constructing robust video understanding systems, effectively surmounting constraints associated with predefined visual tasks. These sophisticated MLLMs exhibit remarkable proficiency in comprehending videos, swiftly attaining unprecedented performance levels across diverse benchmarks. However, their operation demands substantial memory and computational resources, underscoring the continued importance of traditional models in video comprehension tasks. In this paper, we introduce a novel learning paradigm termed MLLM4WTAL. This paradigm harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis