Storyboard guided Alignment for Fine-grained Video Action Recognition
Enqi Liu, Liyuan Pan, Yan Yang, Yiran Zhong, Zhijing Wu, Xinxiao Wu,, Liu Liu

TL;DR
This paper introduces a multi-granularity framework for fine-grained video action recognition that leverages storyboarding-inspired fine-grained descriptions and key frame aggregation to improve video-text alignment accuracy.
Contribution
It proposes a novel approach combining global semantics with fine-grained descriptions generated by large language models for better video-text matching.
Findings
Superior performance in supervised, few-shot, and zero-shot settings
Effective identification of key frames for embedding aggregation
Enhanced alignment accuracy in fine-grained action recognition
Abstract
Fine-grained video action recognition can be conceptualized as a video-text matching problem. Previous approaches often rely on global video semantics to consolidate video embeddings, which can lead to misalignment in video-text pairs due to a lack of understanding of action semantics at an atomic granularity level. To tackle this challenge, we propose a multi-granularity framework based on two observations: (i) videos with different global semantics may share similar atomic actions or appearances, and (ii) atomic actions within a video can be momentary, slow, or even non-directly related to the global video semantics. Inspired by the concept of storyboarding, which disassembles a script into individual shots, we enhance global video semantics by generating fine-grained descriptions using a pre-trained large language model. These detailed descriptions capture common atomic actions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Video Analysis and Summarization
