EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations
Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard, Higgins, Sanja Fidler, David Fouhey, Dima Damen

TL;DR
VISOR is a comprehensive egocentric video dataset with detailed pixel annotations and benchmarks for segmenting hands and objects, addressing challenges like object interactions and long-term consistency.
Contribution
The paper introduces VISOR, a large-scale dataset with novel annotation pipeline and benchmarks for egocentric video segmentation and interaction understanding.
Findings
272K manual semantic masks released
New challenges for long-term object interaction understanding
Benchmark results for video segmentation and interaction tasks
Abstract
We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked - where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality. In total, we publicly release 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, 67K hand-object relations, covering 36 hours of 179 untrimmed videos. Along with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Human Pose and Action Recognition
