Depth-Guided Self-Supervised Human Keypoint Detection via Cross-Modal Distillation
Aman Anand, Elyas Rashno, Amir Eskandari, Farhana Zulkernine

TL;DR
This paper introduces Distill-DKP, a self-supervised framework that uses depth maps to improve human keypoint detection, significantly outperforming previous methods by leveraging cross-modal knowledge distillation.
Contribution
The paper presents a novel cross-modal knowledge distillation approach that incorporates depth information to enhance unsupervised human keypoint detection.
Findings
Reduces mean L2 error by 47.15% on Human3.6M
Improves keypoint accuracy by 1.3% on DeepFashion
Demonstrates effective knowledge transfer across network layers
Abstract
Existing unsupervised keypoint detection methods apply artificial deformations to images such as masking a significant portion of images and using reconstruction of original image as a learning objective to detect keypoints. However, this approach lacks depth information in the image and often detects keypoints on the background. To address this, we propose Distill-DKP, a novel cross-modal knowledge distillation framework that leverages depth maps and RGB images for keypoint detection in a self-supervised setting. During training, Distill-DKP extracts embedding-level knowledge from a depth-based teacher model to guide an image-based student model with inference restricted to the student. Experiments show that Distill-DKP significantly outperforms previous unsupervised methods by reducing mean L2 error by 47.15% on Human3.6M, mean average error by 5.67% on Taichi, and improving keypoints…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsKnowledge Distillation
