Spatio-temporal Decoupled Knowledge Compensator for Few-Shot Action Recognition
Hongyu Qu, Xiangbo Shu, Rui Yan, Hailiang Gao, Wenguan Wang, Jinhui Tang

TL;DR
This paper introduces DiST, a novel framework leveraging large language models to decouple and incorporate spatial and temporal knowledge, significantly improving few-shot action recognition by learning detailed prototypes.
Contribution
The paper proposes a decoupled spatio-temporal knowledge framework using language models to enhance prototype learning in FSAR, which is a novel approach.
Findings
Achieves state-of-the-art results on five FSAR datasets.
Effectively captures fine-grained spatial details.
Models diverse temporal patterns.
Abstract
Few-Shot Action Recognition (FSAR) is a challenging task that requires recognizing novel action categories with a few labeled videos. Recent works typically apply semantically coarse category names as auxiliary contexts to guide the learning of discriminative visual features. However, such context provided by the action names is too limited to provide sufficient background knowledge for capturing novel spatial and temporal concepts in actions. In this paper, we propose DiST, an innovative Decomposition-incorporation framework for FSAR that makes use of decoupled Spatial and Temporal knowledge provided by large language models to learn expressive multi-granularity prototypes. In the decomposition stage, we decouple vanilla action names into diverse spatio-temporal attribute descriptions (action-related knowledge). Such commonsense knowledge complements semantic contexts from spatial and…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
It's a creative attempt to calculate temporal/spatial prototypes with semantic features, emphasizing different aspects. This work achieves good performance on these datasets.
In the representation of temporal knowledge, there are some noun phrases that incorporate words derived from spatial knowledge. This makes the modeling of the temporal domain less pure, and perhaps it would be better to opt for simpler verbs only.
1. The idea that leveraging different types of text descriptions generated by LLMs is somewhat novel. 2. The performance of the proposed model is good, which is much better than previous models. 3. The authors also conduct detailed analysis to the proposed model.
1. The proposed model conduct a decomposition according to the GPT outputs, i.e., generated texts about action-related objects and action states. I wonder if there are some overlapping between between object texts and state texts for an action category. More importantly, how can the authors guarantee that action representations learned by the supervision of socalled action states encode temporal information. 2. Please provide some quantitative analysis to demonstrate that the model can exactly
1. The approach of leveraging spatial and temporal attributes to enhance few-shot action recognition is well-motivated. 2. Extensive experiments validate the effectiveness of DIST and demonstrate the contribution of each component.
1. Some experimental settings and analyses are unclear and could be confusing. Refer to the "Questions" section for details. 2. Some previous works have also utilized LLMs to generate attributes or concepts for action recognition or action representation learning [1,2]. The authors should discuss the key differences in how DIST leverages the generated attributes compared to these works and explain why DIST is better in FSAR. [1] OST: Refining Text Knowledge with Optimal Spatio-Temporal Descript
1. Comprehensive experimental setup. 2. Outstanding experimental results, achieving significant improvements on various datasets. 3. Clear and aesthetically pleasing images and tables.
**1. The method is not sufficiently reasonable.** The authors propose to utilize information provided by large models after decoupling space and time, but the method of providing this information is too naive. 1.1 The authors do not consider whether irrelevant objects will introduce noise when providing spatial information. The method would be more reasonable if an object detection model could be used to check whether the supplementary object information exists. The authors should conduct a n
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
