BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

Patrick Knab; Orgest Xhelili; Inis Buzi; Drago Andres Guggiana Nilo; Mohd Saquib Khan; Lorenz Kolb; Manuel Scherzer; Kerem Yildirir; Christian Bartelt; Philipp Johannes Schubert

arXiv:2605.12074·cs.CV·May 13, 2026

BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

Patrick Knab, Orgest Xhelili, Inis Buzi, Drago Andres Guggiana Nilo, Mohd Saquib Khan, Lorenz Kolb, Manuel Scherzer, Kerem Yildirir, Christian Bartelt, Philipp Johannes Schubert

PDF

1 Repo 1 Datasets

TL;DR

BARISTA is a comprehensive egocentric benchmark dataset with detailed annotations for procedural video understanding, enabling evaluation of multiple interconnected visual reasoning tasks.

Contribution

It introduces a new densely annotated dataset and benchmark for egocentric procedural videos, facilitating multi-task evaluation of physical scene understanding.

Findings

01

Models show high variability across tasks.

02

No single model family dominates performance.

03

BARISTA is a challenging diagnostic tool for procedural understanding.

Abstract

Scene understanding is central to general physical intelligence, and video is a primary modality for capturing both state and temporal dynamics of a scene. Yet understanding physical processes remains difficult, as models must combine object localization, hand-object interactions, relational parsing, temporal reasoning, and step-level procedural inference. Existing benchmarks usually evaluate these capabilities separately, limiting diagnosis of why models fail on procedural tasks. We introduce BARISTA, a densely annotated egocentric dataset and benchmark of 185 real-world coffee-preparation videos covering fully automatic, portafilter-based, and capsule-based workflows. BARISTA provides verified per-frame scene graphs linking persistent object identities to masks, tracks, boxes, attributes, typed relations, hand-object interactions, activities, and process steps. From these graphs, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/ramblr/BARISTA
github

Datasets

ramblr/BARISTA
dataset· 539 dl
539 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.