Recipe Generation from Unsegmented Cooking Videos

Taichi Nishimura; Atsushi Hashimoto; Yoshitaka Ushiku and; Hirotaka Kameko; Shinsuke Mori

arXiv:2209.10134·cs.MM·February 20, 2024

Recipe Generation from Unsegmented Cooking Videos

Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku and, Hirotaka Kameko, Shinsuke Mori

PDF

Open Access

TL;DR

This paper introduces a transformer-based multimodal approach for recipe generation from unsegmented cooking videos, focusing on selecting key events and generating accurate, story-aware recipes, outperforming existing dense video captioning models.

Contribution

It proposes a novel event selection and sentence generation method that incorporates recipe story awareness and ingredients, improving accuracy over prior dense video captioning models.

Findings

01

Outperforms state-of-the-art dense video captioning models

02

Produces recipes with correct event order and appropriate number of steps

03

Incorporates ingredients for more accurate recipe generation

Abstract

This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is crucial, and a model should extract an appropriate number of events in the correct order and generate accurate sentences based on them. We analyze the output of the DVC model and confirm that although (1) several events are adoptable as a recipe story, (2) the generated sentences for such events are not grounded in the visual content. Based on this, we set our goal to obtain correct recipes by selecting oracle events from the output events and re-generating sentences for them. To achieve this,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques