Enhancing Video Transformers for Action Understanding with VLM-aided Training
Hui Lu, Hu Jian, Ronald Poppe, Albert Ali Salah

TL;DR
This paper introduces the FTP framework that enhances video transformer models by aligning their visual encodings with visual language models during training, leading to improved action understanding and state-of-the-art accuracy.
Contribution
The paper proposes the Four-tiered Prompts framework that combines ViTs and VLMs, improving generalization in video action recognition without increasing inference costs.
Findings
Achieved 93.8% top-1 accuracy on Kinetics-400
Surpassed VideoMAEv2 by 2.8% on Kinetics-400
Achieved 83.4% accuracy on Something-Something V2
Abstract
Owing to their ability to extract relevant spatio-temporal video embeddings, Vision Transformers (ViTs) are currently the best performing models in video action understanding. However, their generalization over domains or datasets is somewhat limited. In contrast, Visual Language Models (VLMs) have demonstrated exceptional generalization performance, but are currently unable to process videos. Consequently, they cannot extract spatio-temporal patterns that are crucial for action understanding. In this paper, we propose the Four-tiered Prompts (FTP) framework that takes advantage of the complementary strengths of ViTs and VLMs. We retain ViTs' strong spatio-temporal representation ability but improve the visual encodings to be more comprehensive and general by aligning them with VLM outputs. The FTP framework adds four feature processors that focus on specific aspects of human action in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Industrial Vision Systems and Defect Detection · Currency Recognition and Detection
MethodsFocus
