ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings   for Video Action Recognition

Soumyabrata Chaudhuri; Saumik Bhattacharya

arXiv:2308.03908·cs.CV·August 9, 2023

ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition

Soumyabrata Chaudhuri, Saumik Bhattacharya

PDF

Open Access 1 Repo

TL;DR

This paper introduces ViLP, a novel multi-modal framework combining vision, language, and pose embeddings for improved video action recognition, achieving high accuracy without extensive pre-training.

Contribution

The paper presents the first pose-augmented vision-language model for video action recognition, integrating pose, visual, and text modalities for enhanced performance.

Findings

01

Achieves 92.81% accuracy on UCF-101 without pre-training.

02

Achieves 73.02% accuracy on HMDB-51 without pre-training.

03

Improves to 96.11% and 75.75% accuracy after kinetics pre-training.

Abstract

Video Action Recognition (VAR) is a challenging task due to its inherent complexities. Though different approaches have been explored in the literature, designing a unified framework to recognize a large number of human actions is still a challenging problem. Recently, Multi-Modal Learning (MML) has demonstrated promising results in this domain. In literature, 2D skeleton or pose modality has often been used for this task, either independently or in conjunction with the visual information (RGB modality) present in videos. However, the combination of pose, visual information, and text attributes has not been explored yet, though text and pose attributes independently have been proven to be effective in numerous computer vision tasks. In this paper, we present the first pose augmented Vision-language model (VLM) for VAR. Notably, our scheme achieves an accuracy of 92.81% and 73.02% on two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Soumyabrata2003/ViLP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Multimodal Machine Learning Applications