DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding

Thomas Kreutz; Max M\"uhlh\"auser; Alejandro Sanchez Guinea

arXiv:2506.13897·cs.CV·June 27, 2025

DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding

Thomas Kreutz, Max M\"uhlh\"auser, Alejandro Sanchez Guinea

PDF

Open Access

TL;DR

DeSPITE is a novel multi-modal embedding model that learns joint representations of LiDAR point clouds, skeletons, IMU data, and text to enhance human activity understanding tasks like recognition and retrieval.

Contribution

This work introduces DeSPITE, a new deep model that jointly embeds four modalities for improved point cloud human activity analysis and pre-training.

Findings

01

Effective in Skeleton<->Pointcloud<->IMU matching and retrieval

02

Enhances pre-training for point cloud human activity recognition

03

Demonstrates superior performance on multiple datasets

Abstract

Despite LiDAR (Light Detection and Ranging) being an effective privacy-preserving alternative to RGB cameras to perceive human activities, it remains largely underexplored in the context of multi-modal contrastive pre-training for human activity understanding (e.g., human activity recognition (HAR), retrieval, or person re-identification (RE-ID)). To close this gap, our work explores learning the correspondence between LiDAR point clouds, human skeleton poses, IMU data, and text in a joint embedding space. More specifically, we present DeSPITE, a Deep Skeleton-Pointcloud-IMU-Text Embedding model, which effectively learns a joint embedding space across these four modalities. At the heart of our empirical exploration, we have combined the existing LIPD and Babel datasets, which enabled us to synchronize data of all four modalities, allowing us to explore the learning of a new joint…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation