Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition
Nicholas Babey, Tiffany Gu, Yiheng Li, Cristian Meo, Kevin Zhu

TL;DR
This paper introduces a novel model that combines world dynamics and explicit human pose data to improve action recognition in complex, occlusive scenes, emphasizing the importance of spatial understanding.
Contribution
The paper presents a new architecture that fuses predictive world dynamics with explicit human pose data for more robust action recognition.
Findings
Outperforms baseline models on InHARD and UCF-19-Y-OCC benchmarks.
Excels in complex scenes with occlusions.
Highlights the importance of spatial grounding over pattern recognition.
Abstract
For embodied agents to effectively understand and interact within the world around them, they require a nuanced comprehension of human actions grounded in physical space. Current action recognition models, often relying on RGB video, learn superficial correlations between patterns and action labels, so they struggle to capture underlying physical interaction dynamics and human poses in complex scenes. We propose a model architecture that grounds action recognition in physical space by fusing two powerful, complementary representations: V-JEPA 2's contextual, predictive world dynamics and CoMotion's explicit, occlusion-tolerant human pose data. Our model is validated on both the InHARD and UCF-19-Y-OCC benchmarks for general action recognition and high-occlusion action recognition, respectively. Our model outperforms three other baselines, especially within complex, occlusive scenes. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Multimodal Machine Learning Applications
