Recipe Generation from Unsegmented Cooking Videos
Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku and, Hirotaka Kameko, Shinsuke Mori

TL;DR
This paper introduces a transformer-based multimodal approach for recipe generation from unsegmented cooking videos, focusing on selecting key events and generating accurate, story-aware recipes, outperforming existing dense video captioning models.
Contribution
It proposes a novel event selection and sentence generation method that incorporates recipe story awareness and ingredients, improving accuracy over prior dense video captioning models.
Findings
Outperforms state-of-the-art dense video captioning models
Produces recipes with correct event order and appropriate number of steps
Incorporates ingredients for more accurate recipe generation
Abstract
This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is crucial, and a model should extract an appropriate number of events in the correct order and generate accurate sentences based on them. We analyze the output of the DVC model and confirm that although (1) several events are adoptable as a recipe story, (2) the generated sentences for such events are not grounded in the visual content. Based on this, we set our goal to obtain correct recipes by selecting oracle events from the output events and re-generating sentences for them. To achieve this,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques
