Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

Baoli Sun; Yihan Wang; Xinzhu Ma; Zhihui Wang; Kun Lu; Zhiyong Wang

arXiv:2511.21202·cs.CV·November 27, 2025

Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

Baoli Sun, Yihan Wang, Xinzhu Ma, Zhihui Wang, Kun Lu, Zhiyong Wang

PDF

Open Access

TL;DR

This paper introduces the Action-Region Tracking framework for fine-grained video action recognition, effectively capturing subtle local details over time using a novel query-response mechanism and semantic constraints.

Contribution

The work proposes a new framework that leverages semantic queries and contrastive learning to improve fine-grained action recognition by tracking local region dynamics.

Findings

01

Outperforms previous state-of-the-art methods on benchmark datasets.

02

Effectively captures subtle local action details over time.

03

Utilizes a novel semantic query mechanism for region response detection.

Abstract

Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Human Motion and Animation