TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition

Imtiaz Ul Hassan; Nik Bessis; Ardhendu Behera

arXiv:2604.11498·cs.CV·April 14, 2026

TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition

Imtiaz Ul Hassan, Nik Bessis, Ardhendu Behera

PDF

TL;DR

TAG-Head is a lightweight, plug-and-play spatio-temporal graph head that enhances RGB-only fine-grained action recognition by capturing long-range dependencies and stabilizing motion cues, achieving state-of-the-art results.

Contribution

Introduces TAG-Head, a novel RGB-only module combining Transformer and graph components for improved fine-grained action recognition.

Findings

01

Sets new state-of-the-art on FineGym and HAA500 datasets.

02

Surpasses many multimodal approaches using only RGB data.

03

Maintains low latency and parameter overhead.

Abstract

Fine-grained human action recognition (FHAR) is challenging because visually similar actions differ by subtle spatio-temporal cues. Many recent systems enhance discriminability with extra modalities (e.g., pose, text, optical flow), but this increases annotation burden and computational cost. We introduce TAG-Head, a lightweight spatio-temporal graph head that upgrades standard 3D backbones (SlowFast, R(2+1)D-34, I3D, etc.) for FHAR using RGB only. Our pipeline first applies a Transformer encoder with learnable 3D positional encodings to the backbone tokens, capturing long-range dependencies across space and time. The resulting features are then refined by a graph in which (i) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (ii) time-aligned temporal edges that connect features at the same spatial location across frames to stabilise motion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.