Multi-modal Cooking Workflow Construction for Food Recipes

Liangming Pan; Jingjing Chen; Jianlong Wu; Shaoteng Liu; Chong-Wah; Ngo; Min-Yen Kan; Yu-Gang Jiang; Tat-Seng Chua

arXiv:2008.09151·cs.CL·August 24, 2020

Multi-modal Cooking Workflow Construction for Food Recipes

Liangming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah, Ngo, Min-Yen Kan, Yu-Gang Jiang, Tat-Seng Chua

PDF

TL;DR

This paper introduces MM-ReS, a large-scale multi-modal dataset of food recipes with workflow graphs, and proposes a neural model that effectively combines images and text to construct cooking workflows, outperforming previous methods.

Contribution

The paper presents the first large-scale multi-modal dataset for cooking workflow construction and a neural model that leverages both images and text to improve workflow prediction.

Findings

01

Achieved over 20% performance improvement over baseline methods.

02

Created the first large-scale dataset with human-labeled workflow graphs.

03

Demonstrated the effectiveness of multi-modal information in cooking workflow understanding.

Abstract

Understanding food recipe requires anticipating the implicit causal effects of cooking actions, such that the recipe can be converted into a graph describing the temporal workflow of the recipe. This is a non-trivial task that involves common-sense reasoning. However, existing efforts rely on hand-crafted features to extract the workflow graph from recipes due to the lack of large-scale labeled datasets. Moreover, they fail to utilize the cooking images, which constitute an important part of food recipes. In this paper, we build MM-ReS, the first large-scale dataset for cooking workflow construction, consisting of 9,850 recipes with human-labeled workflow graphs. Cooking steps are multi-modal, featuring both text instructions and cooking images. We then propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.