Towards Tokenized Human Dynamics Representation

Kenneth Li; Xiao Sun; Zhirong Wu; Fangyun Wei; Stephen Lin

arXiv:2111.11433·cs.CV·November 23, 2021·1 cites

Towards Tokenized Human Dynamics Representation

Kenneth Li, Xiao Sun, Zhirong Wu, Fangyun Wei, Stephen Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces a self-supervised framework for segmenting and clustering long human motion videos into recurring patterns, enabling effective video tokenization and improving downstream tasks like genre classification and action segmentation.

Contribution

It presents a novel two-stage self-supervised approach for video tokenization by learning frame representations and clustering them into actons, addressing annotation scarcity in long human dynamics.

Findings

01

Significant performance improvements on AIST++ and PKU-MMD datasets.

02

Effective frame-wise representation learning evaluated by Kendall's Tau.

03

Successful application to genre classification, action segmentation, and action composition.

Abstract

For human action understanding, a popular research direction is to analyze short video clips with unambiguous semantic content, such as jumping and drinking. However, methods for understanding short semantic actions cannot be directly translated to long human dynamics such as dancing, where it becomes challenging even to label the human movements semantically. Meanwhile, the natural language processing (NLP) community has made progress in solving a similar challenge of annotation scarcity by large-scale pre-training, which improves several downstream tasks with one model. In this work, we study how to segment and cluster videos into recurring temporal patterns in a self-supervised way, namely acton discovery, the main roadblock towards video tokenization. We propose a two-stage framework that first obtains a frame-wise representation by contrasting two augmented views of video frames…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

likenneth/acton
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications