HD-EPIC: A Highly-Detailed Egocentric Video Dataset

Toby Perrett; Ahmad Darkhalil; Saptarshi Sinha; Omar Emara; Sam; Pollard; Kranti Parida; Kaiting Liu; Prajwal Gatti; Siddhant Bansal; Kevin; Flanagan; Jacob Chalk; Zhifan Zhu; Rhodri Guerrier; Fahd Abdelazim; Bin Zhu,; Davide Moltisanti; Michael Wray; Hazel Doughty; Dima Damen

arXiv:2502.04144·cs.CV·March 26, 2025

HD-EPIC: A Highly-Detailed Egocentric Video Dataset

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam, Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin, Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu,, Davide Moltisanti, Michael Wray, Hazel Doughty, Dima Damen

PDF

Open Access

TL;DR

HD-EPIC is a comprehensive egocentric video dataset with detailed 3D annotations, enabling advanced research in recipe understanding, action recognition, and scene perception in real-world kitchen environments.

Contribution

The paper introduces HD-EPIC, a novel in-the-wild egocentric dataset with detailed 3D annotations, including recipes, actions, and scene information, supporting diverse vision tasks.

Findings

01

Powerful annotations enable challenging VQA benchmarks.

02

Current VLMs struggle with detailed egocentric data.

03

HD-EPIC supports multiple vision tasks like action and sound recognition.

Abstract

We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Video Coding and Compression Technologies