EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations

Ahmad Darkhalil; Dandan Shan; Bin Zhu; Jian Ma; Amlan Kar; Richard; Higgins; Sanja Fidler; David Fouhey; Dima Damen

arXiv:2209.13064·cs.CV·September 28, 2022·24 cites

EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations

Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard, Higgins, Sanja Fidler, David Fouhey, Dima Damen

PDF

Open Access 3 Repos 1 Video

TL;DR

VISOR is a comprehensive egocentric video dataset with detailed pixel annotations and benchmarks for segmenting hands and objects, addressing challenges like object interactions and long-term consistency.

Contribution

The paper introduces VISOR, a large-scale dataset with novel annotation pipeline and benchmarks for egocentric video segmentation and interaction understanding.

Findings

01

272K manual semantic masks released

02

New challenges for long-term object interaction understanding

03

Benchmark results for video segmentation and interaction tasks

Abstract

We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked - where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality. In total, we publicly release 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, 67K hand-object relations, covering 36 hours of 179 untrimmed videos. Along with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations· slideslive

Taxonomy

TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Human Pose and Action Recognition