Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living
Dominick Reilly, Srijan Das

TL;DR
This paper introduces PI-ViT, a novel video transformer that incorporates 2D and 3D human pose information via auxiliary modules to improve activity recognition in Activities of Daily Living, achieving state-of-the-art results.
Contribution
The paper presents the first pose-augmented video transformer for ADL, using auxiliary pose modules during training that are discarded during inference, enhancing recognition accuracy.
Findings
Achieves state-of-the-art performance on three ADL datasets.
Operates without pose data or extra computational cost during inference.
Effectively distinguishes similar actions across multiple viewpoints.
Abstract
Video transformers have become the de facto standard for human action recognition, yet their exclusive reliance on the RGB modality still limits their adoption in certain domains. One such domain is Activities of Daily Living (ADL), where RGB alone is not sufficient to distinguish between visually similar actions, or actions observed from multiple viewpoints. To facilitate the adoption of video transformers for ADL, we hypothesize that the augmentation of RGB with human pose information, known for its sensitivity to fine-grained motion and multiple viewpoints, is essential. Consequently, we introduce the first Pose Induced Video Transformer: PI-ViT (or -ViT), a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information. The key elements of -ViT are two plug-in modules, 2D Skeleton Induction Module and 3D Skeleton…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Stroke Rehabilitation and Recovery · Hand Gesture Recognition Systems
