DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding
Thomas Kreutz, Max M\"uhlh\"auser, Alejandro Sanchez Guinea

TL;DR
DeSPITE is a novel multi-modal embedding model that learns joint representations of LiDAR point clouds, skeletons, IMU data, and text to enhance human activity understanding tasks like recognition and retrieval.
Contribution
This work introduces DeSPITE, a new deep model that jointly embeds four modalities for improved point cloud human activity analysis and pre-training.
Findings
Effective in Skeleton<->Pointcloud<->IMU matching and retrieval
Enhances pre-training for point cloud human activity recognition
Demonstrates superior performance on multiple datasets
Abstract
Despite LiDAR (Light Detection and Ranging) being an effective privacy-preserving alternative to RGB cameras to perceive human activities, it remains largely underexplored in the context of multi-modal contrastive pre-training for human activity understanding (e.g., human activity recognition (HAR), retrieval, or person re-identification (RE-ID)). To close this gap, our work explores learning the correspondence between LiDAR point clouds, human skeleton poses, IMU data, and text in a joint embedding space. More specifically, we present DeSPITE, a Deep Skeleton-Pointcloud-IMU-Text Embedding model, which effectively learns a joint embedding space across these four modalities. At the heart of our empirical exploration, we have combined the existing LIPD and Babel datasets, which enabled us to synchronize data of all four modalities, allowing us to explore the learning of a new joint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation
