EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models

Andy Bonnetto; Haozhe Qi; Franklin Leong; Matea Tashkovska; Mahdi Rad; Solaiman Shokur; Friedhelm Hummel; Silvestro Micera; Marc Pollefeys; Alexander Mathis

arXiv:2506.01608·cs.CV·August 26, 2025

EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models

Andy Bonnetto, Haozhe Qi, Franklin Leong, Matea Tashkovska, Mahdi Rad, Solaiman Shokur, Friedhelm Hummel, Silvestro Micera, Marc Pollefeys, Alexander Mathis

PDF

1 Repo 4 Datasets

TL;DR

The EPFL-Smart-Kitchen-30 dataset provides a comprehensive, multi-modal collection of human actions in a kitchen environment, enabling advanced research in behavior understanding, with multiple benchmarks for vision-language, motion generation, and action recognition.

Contribution

This work introduces a densely annotated, multi-modal kitchen activity dataset with benchmarks for behavior modeling, advancing research in video and language understanding of human actions.

Findings

01

Dataset includes 29.7 hours of multi-view recordings of cooking activities.

02

Four benchmarks established for behavior understanding and modeling.

03

Code and data are publicly available for research use.

Abstract

Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amathislab/epfl-smart-kitchen
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.