MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization

Zhenying Fang; Richang Hong

arXiv:2511.13039·cs.CV·November 18, 2025

MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization

Zhenying Fang, Richang Hong

PDF

Open Access

TL;DR

MGCA-Net introduces a multi-grained, category-aware approach for open-vocabulary temporal action localization, significantly improving recognition accuracy for both base and novel categories in videos.

Contribution

The paper proposes MGCA-Net, a novel network with coarse-to-fine classification for better open-vocabulary action localization, addressing single-granularity limitations of prior methods.

Findings

01

Achieves state-of-the-art results on THUMOS'14 and ActivityNet-1.3.

02

Excels in zero-shot temporal action localization scenarios.

03

Enhances localization accuracy through multi-grained category awareness.

Abstract

Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Action Observation and Synchronization