TL;DR
This paper introduces CoSeg, a self-supervised, transformer-based framework inspired by human cognition for generic event segmentation, achieving superior boundary detection accuracy across multiple datasets.
Contribution
The work presents a novel, cognitively inspired, self-supervised event segmentation method using transformer-based feature reconstruction and temporal contrastive learning.
Findings
Outperforms previous methods on four datasets
Achieves high F1 scores for event boundary detection
Effective in segmenting generic events
Abstract
Some cognitive research has discovered that humans accomplish event segmentation as a side effect of event anticipation. Inspired by this discovery, we propose a simple yet effective end-to-end self-supervised learning framework for event segmentation/boundary detection. Unlike the mainstream clustering-based methods, our framework exploits a transformer-based feature reconstruction scheme to detect event boundary by reconstruction errors. This is consistent with the fact that humans spot new events by leveraging the deviation between their prediction and what is actually perceived. Thanks to their heterogeneity in semantics, the frames at boundaries are difficult to be reconstructed (generally with large reconstruction errors), which is favorable for event boundary detection. Additionally, since the reconstruction occurs on the semantic feature level instead of pixel level, we develop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
