Unsupervised Video Understanding by Reconciliation of Posture Similarities
Timo Milbich, Miguel Bautista, Ekaterina Sutter, Bjorn Ommer

TL;DR
This paper introduces an unsupervised deep learning method for understanding human activities in videos by learning posture representations without manual annotations, enabling retrieval, super-resolution, and frame synthesis.
Contribution
It presents a novel unsupervised approach that combines sequence matching and CNNs to learn structured posture embeddings from raw video data.
Findings
Learns posture representations without supervision
Enables posture retrieval and temporal super-resolution
Allows frame synthesis based on learned embeddings
Abstract
Understanding human activity and being able to explain it in detail surpasses mere action classification by far in both complexity and value. The challenge is thus to describe an activity on the basis of its most fundamental constituents, the individual postures and their distinctive transitions. Supervised learning of such a fine-grained representation based on elementary poses is very tedious and does not scale. Therefore, we propose a completely unsupervised deep learning procedure based solely on video sequences, which starts from scratch without requiring pre-trained networks, predefined body models, or keypoints. A combinatorial sequence matching algorithm proposes relations between frames from subsets of the training data, while a CNN is reconciling the transitivity conflicts of the different subsets to learn a single concerted pose embedding despite changes in appearance across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
