Temporal Representation Learning on Monocular Videos for 3D Human Pose Estimation
Sina Honari, Victor Constantin, Helge Rhodin, Mathieu Salzmann, Pascal, Fua

TL;DR
This paper introduces an unsupervised method for learning temporal features from monocular videos, improving 3D human pose estimation by disentangling time-variant and invariant features and applying contrastive learning.
Contribution
It proposes a novel contrastive self-supervised approach that explicitly disentangles temporal features, leading to significant error reduction and improved 3D pose estimation accuracy.
Findings
Reduces pose estimation error by about 50% compared to standard CSS methods.
Outperforms other unsupervised single-view approaches.
Matches the performance of multi-view techniques.
Abstract
In this paper we propose an unsupervised feature extraction method to capture temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying contrastive loss only to the time-variant features and encouraging a gradual transition on them between nearby and away frames while also reconstructing the input, extract rich temporal features, well-suited for human pose estimation. Our approach reduces error by about 50% compared to the standard CSS strategies, outperforms other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Vision and Imaging
