Enhancing Video Transformers for Action Understanding with VLM-aided   Training

Hui Lu; Hu Jian; Ronald Poppe; Albert Ali Salah

arXiv:2403.16128·cs.CV·March 26, 2024·1 cites

Enhancing Video Transformers for Action Understanding with VLM-aided Training

Hui Lu, Hu Jian, Ronald Poppe, Albert Ali Salah

PDF

Open Access

TL;DR

This paper introduces the FTP framework that enhances video transformer models by aligning their visual encodings with visual language models during training, leading to improved action understanding and state-of-the-art accuracy.

Contribution

The paper proposes the Four-tiered Prompts framework that combines ViTs and VLMs, improving generalization in video action recognition without increasing inference costs.

Findings

01

Achieved 93.8% top-1 accuracy on Kinetics-400

02

Surpassed VideoMAEv2 by 2.8% on Kinetics-400

03

Achieved 83.4% accuracy on Something-Something V2

Abstract

Owing to their ability to extract relevant spatio-temporal video embeddings, Vision Transformers (ViTs) are currently the best performing models in video action understanding. However, their generalization over domains or datasets is somewhat limited. In contrast, Visual Language Models (VLMs) have demonstrated exceptional generalization performance, but are currently unable to process videos. Consequently, they cannot extract spatio-temporal patterns that are crucial for action understanding. In this paper, we propose the Four-tiered Prompts (FTP) framework that takes advantage of the complementary strengths of ViTs and VLMs. We retain ViTs' strong spatio-temporal representation ability but improve the visual encodings to be more comprehensive and general by aligning them with VLM outputs. The FTP framework adds four feature processors that focus on specific aspects of human action in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Industrial Vision Systems and Defect Detection · Currency Recognition and Detection

MethodsFocus