SHERLock: Self-Supervised Hierarchical Event Representation Learning
Sumegh Roychowdhury, Sumedh A. Sontakke, Nikaash Puri, Mausoom Sarkar,, Milan Aggarwal, Pinkesh Badjatiya, Balaji Krishnamurthy, Laurent Itti

TL;DR
SHERLock introduces a self-supervised hierarchical model that learns temporal event representations from visual and textual data, closely aligning with human-annotated events and performing well in complex visual tasks.
Contribution
The paper presents a novel self-supervised hierarchical model for learning temporal event representations from multimodal data without explicit supervision.
Findings
Outperforms state-of-the-art unsupervised methods (+15.3 alignment score)
Comparable to heavily-supervised baselines in complex visual domains
Demonstrates robustness through ablation studies
Abstract
Temporal event representations are an essential aspect of learning among humans. They allow for succinct encoding of the experiences we have through a variety of sensory inputs. Also, they are believed to be arranged hierarchically, allowing for an efficient representation of complex long-horizon experiences. Additionally, these representations are acquired in a self-supervised manner. Analogously, here we propose a model that learns temporal representations from long-horizon visual demonstration data and associated textual descriptions, without explicit temporal supervision. Our method produces a hierarchy of representations that align more closely with ground-truth human-annotated events (+15.3) than state-of-the-art unsupervised baselines. Our results are comparable to heavily-supervised baselines in complex visual domains such as Chess Openings, YouCook2 and TutorialVQA datasets.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
