TL;DR
This paper introduces an unsupervised, temporally-weighted hierarchical clustering method for action segmentation in videos, achieving significant improvements over existing methods without requiring labeled data.
Contribution
The paper presents a novel unsupervised clustering algorithm that effectively segments actions in videos by incorporating temporal information, eliminating the need for training data.
Findings
Achieves state-of-the-art performance on five datasets
Demonstrates the effectiveness of temporal weighting in clustering
Provides strong unsupervised baselines for action segmentation
Abstract
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video. Our main finding is that representing a video with a 1-nearest neighbor graph by taking into account the time progression is sufficient to form semantically and temporally consistent clusters of frames where each cluster may represent some action in the video. Additionally, we establish…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFirst Integer Neighbor Clustering Hierarchy (FINCH))
