TL;DR
This paper introduces a method to align multimodal recipe instructions across text and video sources, creating a large dataset that enhances understanding of procedural tasks through rich, commonsense aligned data.
Contribution
It presents an unsupervised alignment algorithm and a graph-based approach to align multiple text and video recipes, along with releasing a large, annotated dataset.
Findings
Successfully aligned 150K recipe instructions across modalities
Created a dataset with rich commonsense information for 4,262 dishes
Demonstrated the effectiveness of the alignment method
Abstract
Many high-level procedural tasks can be decomposed into sequences of instructions that vary in their order and choice of tools. In the cooking domain, the web offers many partially-overlapping text and video recipes (i.e. procedures) that describe how to make the same dish (i.e. high-level task). Aligning instructions for the same dish across different sources can yield descriptive visual explanations that are far richer semantically than conventional textual instructions, providing commonsense insight into how real-world procedures are structured. Learning to align these different instruction sets is challenging because: a) different recipes vary in their order of instructions and use of ingredients; and b) video instructions can be noisy and tend to contain far more information than text instructions. To address these challenges, we first use an unsupervised alignment algorithm that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
