ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition
Soumyabrata Chaudhuri, Saumik Bhattacharya

TL;DR
This paper introduces ViLP, a novel multi-modal framework combining vision, language, and pose embeddings for improved video action recognition, achieving high accuracy without extensive pre-training.
Contribution
The paper presents the first pose-augmented vision-language model for video action recognition, integrating pose, visual, and text modalities for enhanced performance.
Findings
Achieves 92.81% accuracy on UCF-101 without pre-training.
Achieves 73.02% accuracy on HMDB-51 without pre-training.
Improves to 96.11% and 75.75% accuracy after kinetics pre-training.
Abstract
Video Action Recognition (VAR) is a challenging task due to its inherent complexities. Though different approaches have been explored in the literature, designing a unified framework to recognize a large number of human actions is still a challenging problem. Recently, Multi-Modal Learning (MML) has demonstrated promising results in this domain. In literature, 2D skeleton or pose modality has often been used for this task, either independently or in conjunction with the visual information (RGB modality) present in videos. However, the combination of pose, visual information, and text attributes has not been explored yet, though text and pose attributes independently have been proven to be effective in numerous computer vision tasks. In this paper, we present the first pose augmented Vision-language model (VLM) for VAR. Notably, our scheme achieves an accuracy of 92.81% and 73.02% on two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Multimodal Machine Learning Applications
