COM Kitchens: An Unedited Overhead-view Video Dataset as a   Vision-Language Benchmark

Koki Maeda; Tosho Hirasawa; Atsushi Hashimoto; Jun Harashima; Leszek; Rybicki; Yusuke Fukasawa; Yoshitaka Ushiku

arXiv:2408.02272·cs.CV·August 6, 2024

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek, Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku

PDF

Open Access 1 Repo

TL;DR

COM Kitchens introduces a new overhead-view video dataset captured via smartphones for food preparation, enabling research on video-to-text retrieval and dense captioning in a diverse, real-world setting.

Contribution

The paper presents a novel overhead-view video dataset and two new tasks, addressing limitations of existing datasets and enabling advanced vision-language research.

Findings

01

Current web-video-based methods have limitations on new tasks.

02

The dataset captures diverse real-world cooking activities.

03

Proposed tasks highlight challenges in video captioning and retrieval.

Abstract

Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query instructional contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedited overhead-view videos captured by smartphones, in which participants performed food preparation based on given recipes. Fixed-viewpoint video datasets often lack environmental diversity due to high camera setup costs. We used modern wide-angle smartphone lenses to cover cooking counters from sink to cooktop in an overhead view, capturing activity without in-person assistance. With this setup, we collected a diverse dataset by distributing smartphones to participants. With this dataset, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

omron-sinicx/com_kitchens
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsSoftmax · Attention Is All You Need