TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition
Imtiaz Ul Hassan, Nik Bessis, Ardhendu Behera

TL;DR
TAG-Head is a lightweight, plug-and-play spatio-temporal graph head that enhances RGB-only fine-grained action recognition by capturing long-range dependencies and stabilizing motion cues, achieving state-of-the-art results.
Contribution
Introduces TAG-Head, a novel RGB-only module combining Transformer and graph components for improved fine-grained action recognition.
Findings
Sets new state-of-the-art on FineGym and HAA500 datasets.
Surpasses many multimodal approaches using only RGB data.
Maintains low latency and parameter overhead.
Abstract
Fine-grained human action recognition (FHAR) is challenging because visually similar actions differ by subtle spatio-temporal cues. Many recent systems enhance discriminability with extra modalities (e.g., pose, text, optical flow), but this increases annotation burden and computational cost. We introduce TAG-Head, a lightweight spatio-temporal graph head that upgrades standard 3D backbones (SlowFast, R(2+1)D-34, I3D, etc.) for FHAR using RGB only. Our pipeline first applies a Transformer encoder with learnable 3D positional encodings to the backbone tokens, capturing long-range dependencies across space and time. The resulting features are then refined by a graph in which (i) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (ii) time-aligned temporal edges that connect features at the same spatial location across frames to stabilise motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
