Visual Recipe Flow: A Dataset for Learning Visual State Changes of   Objects with Recipe Flows

Keisuke Shirai; Atsushi Hashimoto; Taichi Nishimura; Hirotaka Kameko,; Shuhei Kurita; Yoshitaka Ushiku; Shinsuke Mori

arXiv:2209.05840·cs.CL·September 14, 2022·5 cites

Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows

Keisuke Shirai, Atsushi Hashimoto, Taichi Nishimura, Hirotaka Kameko,, Shuhei Kurita, Yoshitaka Ushiku, Shinsuke Mori

PDF

Open Access

TL;DR

This paper introduces Visual Recipe Flow, a multimodal dataset linking recipe text, object state changes, and workflows to facilitate learning visual state changes of objects during cooking.

Contribution

The dataset uniquely combines object state change images with recipe flow graphs, enabling cross-modal learning for cooking actions.

Findings

01

Dataset includes object state change image pairs and recipe flow graphs.

02

Grounded image pairs in recipe flows enable cross-modal reasoning.

03

Supports applications like multimodal commonsense reasoning and procedural text generation.

Abstract

We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). The image pairs are grounded in the r-FG, which provides the cross-modal relation. With our dataset, one can try a range of applications, from multimodal commonsense reasoning and procedural text generation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization