Kronecker Mask and Interpretive Prompts are Language-Action Video   Learners

Jingyi Yang; Zitong Yu; Xiuming Ni; Jia He; Hui Li

arXiv:2502.03549·cs.CV·February 11, 2025

Kronecker Mask and Interpretive Prompts are Language-Action Video Learners

Jingyi Yang, Zitong Yu, Xiuming Ni, Jia He, Hui Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces CLAVER, a novel approach that enhances video action recognition by adapting CLIP with a Kronecker mask for temporal modeling and interpretive prompts for verb understanding, achieving superior results.

Contribution

The paper proposes CLAVER, combining a Kronecker mask for temporal modeling and interpretive prompts from large language models to improve video action recognition with CLIP.

Findings

01

Outperforms existing methods on multiple benchmarks.

02

Effectively models spatiotemporal dynamics in videos.

03

Enhances verb comprehension in action recognition.

Abstract

Contrastive language-image pretraining (CLIP) has significantly advanced image-based vision learning. A pressing topic subsequently arises: how can we effectively adapt CLIP to the video domain? Recent studies have focused on adjusting either the textual or visual branch of CLIP for action recognition. However, we argue that adaptations of both branches are crucial. In this paper, we propose \textbf{CLAVER}: a \textbf{C}ontrastive \textbf{L}anguage-\textbf{A}ction \textbf{V}ideo Learn\textbf{er}, designed to shift CLIP's focus from the alignment of static visual objects and concrete nouns to the alignment of dynamic action behaviors and abstract verbs. Specifically, we introduce a novel Kronecker mask attention for temporal modeling. Our tailored Kronecker mask offers three benefits 1) it expands the temporal receptive field for each token, 2) it serves as an effective spatiotemporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yjyddq/CLAVER
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLinguistic Education and Pedagogy

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Focus