Kronecker Mask and Interpretive Prompts are Language-Action Video Learners
Jingyi Yang, Zitong Yu, Xiuming Ni, Jia He, Hui Li

TL;DR
This paper introduces CLAVER, a novel approach that enhances video action recognition by adapting CLIP with a Kronecker mask for temporal modeling and interpretive prompts for verb understanding, achieving superior results.
Contribution
The paper proposes CLAVER, combining a Kronecker mask for temporal modeling and interpretive prompts from large language models to improve video action recognition with CLIP.
Findings
Outperforms existing methods on multiple benchmarks.
Effectively models spatiotemporal dynamics in videos.
Enhances verb comprehension in action recognition.
Abstract
Contrastive language-image pretraining (CLIP) has significantly advanced image-based vision learning. A pressing topic subsequently arises: how can we effectively adapt CLIP to the video domain? Recent studies have focused on adjusting either the textual or visual branch of CLIP for action recognition. However, we argue that adaptations of both branches are crucial. In this paper, we propose \textbf{CLAVER}: a \textbf{C}ontrastive \textbf{L}anguage-\textbf{A}ction \textbf{V}ideo Learn\textbf{er}, designed to shift CLIP's focus from the alignment of static visual objects and concrete nouns to the alignment of dynamic action behaviors and abstract verbs. Specifically, we introduce a novel Kronecker mask attention for temporal modeling. Our tailored Kronecker mask offers three benefits 1) it expands the temporal receptive field for each token, 2) it serves as an effective spatiotemporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistic Education and Pedagogy
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Focus
