Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition
Zefeng Qian, Xincheng Yao, Yifei Huang, Chongyang Zhang, Jiangyong Ying, Hong Sun

TL;DR
This paper introduces a novel framework called Language-Guided Action Anatomy (LGA) that leverages large language models to dissect and understand human actions in videos for few-shot recognition, surpassing traditional label-based methods.
Contribution
LGA utilizes LLMs to decompose action labels into atomic descriptions and segments videos into phases, enabling more effective multimodal fusion for few-shot action recognition.
Findings
Achieves state-of-the-art results on multiple FSAR benchmarks.
Effectively captures spatiotemporal cues through atomic-level analysis.
Enhances generalization by integrating textual and visual atomic features.
Abstract
Few-shot action recognition (FSAR) aims to classify human actions in videos with only a small number of labeled samples per category. The scarcity of training data has driven recent efforts to incorporate additional modalities, particularly text. However, the subtle variations in human posture, motion dynamics, and the object interactions that occur during different phases, are critical inherent knowledge of actions that cannot be fully exploited by action labels alone. In this work, we propose Language-Guided Action Anatomy (LGA), a novel framework that goes beyond label semantics by leveraging Large Language Models (LLMs) to dissect the essential representational characteristics hidden beneath action labels. Guided by the prior knowledge encoded in LLM, LGA effectively captures rich spatiotemporal cues in few-shot scenarios. Specifically, for text, we prompt an off-the-shelf LLM to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
