Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows
Keisuke Shirai, Atsushi Hashimoto, Taichi Nishimura, Hirotaka Kameko,, Shuhei Kurita, Yoshitaka Ushiku, Shinsuke Mori

TL;DR
This paper introduces Visual Recipe Flow, a multimodal dataset linking recipe text, object state changes, and workflows to facilitate learning visual state changes of objects during cooking.
Contribution
The dataset uniquely combines object state change images with recipe flow graphs, enabling cross-modal learning for cooking actions.
Findings
Dataset includes object state change image pairs and recipe flow graphs.
Grounded image pairs in recipe flows enable cross-modal reasoning.
Supports applications like multimodal commonsense reasoning and procedural text generation.
Abstract
We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). The image pairs are grounded in the r-FG, which provides the cross-modal relation. With our dataset, one can try a range of applications, from multimodal commonsense reasoning and procedural text generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization
