What's Cookin'? Interpreting Cooking Videos using Text, Speech and   Vision

Jonathan Malmaud; Jonathan Huang; Vivek Rathod; Nick Johnston; Andrew; Rabinovich; and Kevin Murphy

arXiv:1503.01558·cs.CL·March 16, 2015

What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision

Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nick Johnston, Andrew, Rabinovich, and Kevin Murphy

PDF

1 Repo

TL;DR

This paper introduces a method for aligning cooking recipe instructions with video content by combining speech transcripts and visual food detection, enabling applications like automatic recipe illustration and event search.

Contribution

It presents a novel alignment approach using an HMM and deep learning-based food detection, improving over keyword spotting methods in cooking videos.

Findings

01

Outperforms keyword spotting techniques

02

Enables automatic recipe illustration with keyframes

03

Facilitates searching for specific events in videos

Abstract

We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task. In particular, we focus on the cooking domain, where the instructions correspond to the recipe. Our technique relies on an HMM to align the recipe steps to the (automatically generated) speech transcript. We then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network. We show that our technique outperforms simpler techniques based on keyword spotting. It also enables interesting applications, such as automatically illustrating recipes with keyframes, and searching within a video for events of interest.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

malmaud/whats_cookin
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.