Towards Tokenized Human Dynamics Representation
Kenneth Li, Xiao Sun, Zhirong Wu, Fangyun Wei, Stephen Lin

TL;DR
This paper introduces a self-supervised framework for segmenting and clustering long human motion videos into recurring patterns, enabling effective video tokenization and improving downstream tasks like genre classification and action segmentation.
Contribution
It presents a novel two-stage self-supervised approach for video tokenization by learning frame representations and clustering them into actons, addressing annotation scarcity in long human dynamics.
Findings
Significant performance improvements on AIST++ and PKU-MMD datasets.
Effective frame-wise representation learning evaluated by Kendall's Tau.
Successful application to genre classification, action segmentation, and action composition.
Abstract
For human action understanding, a popular research direction is to analyze short video clips with unambiguous semantic content, such as jumping and drinking. However, methods for understanding short semantic actions cannot be directly translated to long human dynamics such as dancing, where it becomes challenging even to label the human movements semantically. Meanwhile, the natural language processing (NLP) community has made progress in solving a similar challenge of annotation scarcity by large-scale pre-training, which improves several downstream tasks with one model. In this work, we study how to segment and cluster videos into recurring temporal patterns in a self-supervised way, namely acton discovery, the main roadblock towards video tokenization. We propose a two-stage framework that first obtains a frame-wise representation by contrasting two augmented views of video frames…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications
