Storyboard guided Alignment for Fine-grained Video Action Recognition

Enqi Liu; Liyuan Pan; Yan Yang; Yiran Zhong; Zhijing Wu; Xinxiao Wu,; Liu Liu

arXiv:2410.14238·cs.CV·October 21, 2024

Storyboard guided Alignment for Fine-grained Video Action Recognition

Enqi Liu, Liyuan Pan, Yan Yang, Yiran Zhong, Zhijing Wu, Xinxiao Wu,, Liu Liu

PDF

Open Access

TL;DR

This paper introduces a multi-granularity framework for fine-grained video action recognition that leverages storyboarding-inspired fine-grained descriptions and key frame aggregation to improve video-text alignment accuracy.

Contribution

It proposes a novel approach combining global semantics with fine-grained descriptions generated by large language models for better video-text matching.

Findings

01

Superior performance in supervised, few-shot, and zero-shot settings

02

Effective identification of key frames for embedding aggregation

03

Enhanced alignment accuracy in fine-grained action recognition

Abstract

Fine-grained video action recognition can be conceptualized as a video-text matching problem. Previous approaches often rely on global video semantics to consolidate video embeddings, which can lead to misalignment in video-text pairs due to a lack of understanding of action semantics at an atomic granularity level. To tackle this challenge, we propose a multi-granularity framework based on two observations: (i) videos with different global semantics may share similar atomic actions or appearances, and (ii) atomic actions within a video can be momentary, slow, or even non-directly related to the global video semantics. Inspired by the concept of storyboarding, which disassembles a script into individual shots, we enhance global video semantics by generating fine-grained descriptions using a pre-trained large language model. These detailed descriptions capture common atomic actions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Video Analysis and Summarization