TL;DR
This paper introduces a method for aligning cooking recipe instructions with video content by combining speech transcripts and visual food detection, enabling applications like automatic recipe illustration and event search.
Contribution
It presents a novel alignment approach using an HMM and deep learning-based food detection, improving over keyword spotting methods in cooking videos.
Findings
Outperforms keyword spotting techniques
Enables automatic recipe illustration with keyframes
Facilitates searching for specific events in videos
Abstract
We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task. In particular, we focus on the cooking domain, where the instructions correspond to the recipe. Our technique relies on an HMM to align the recipe steps to the (automatically generated) speech transcript. We then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network. We show that our technique outperforms simpler techniques based on keyword spotting. It also enables interesting applications, such as automatically illustrating recipes with keyframes, and searching within a video for events of interest.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
