SHERLock: Self-Supervised Hierarchical Event Representation Learning

Sumegh Roychowdhury; Sumedh A. Sontakke; Nikaash Puri; Mausoom Sarkar,; Milan Aggarwal; Pinkesh Badjatiya; Balaji Krishnamurthy; Laurent Itti

arXiv:2010.02556·cs.LG·August 24, 2022

SHERLock: Self-Supervised Hierarchical Event Representation Learning

Sumegh Roychowdhury, Sumedh A. Sontakke, Nikaash Puri, Mausoom Sarkar,, Milan Aggarwal, Pinkesh Badjatiya, Balaji Krishnamurthy, Laurent Itti

PDF

Open Access 1 Repo

TL;DR

SHERLock introduces a self-supervised hierarchical model that learns temporal event representations from visual and textual data, closely aligning with human-annotated events and performing well in complex visual tasks.

Contribution

The paper presents a novel self-supervised hierarchical model for learning temporal event representations from multimodal data without explicit supervision.

Findings

01

Outperforms state-of-the-art unsupervised methods (+15.3 alignment score)

02

Comparable to heavily-supervised baselines in complex visual domains

03

Demonstrates robustness through ablation studies

Abstract

Temporal event representations are an essential aspect of learning among humans. They allow for succinct encoding of the experiences we have through a variety of sensory inputs. Also, they are believed to be arranged hierarchically, allowing for an efficient representation of complex long-horizon experiences. Additionally, these representations are acquired in a self-supervised manner. Analogously, here we propose a model that learns temporal representations from long-horizon visual demonstration data and associated textual descriptions, without explicit temporal supervision. Our method produces a hierarchy of representations that align more closely with ground-truth human-annotated events (+15.3) than state-of-the-art unsupervised baselines. Our results are comparable to heavily-supervised baselines in complex visual domains such as Chess Openings, YouCook2 and TutorialVQA datasets.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UNHCLE/UNHCLE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization