RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation

Ruoxuan Zhang; Jidong Gao; Bin Wen; Hongxia Xie; Chenming Zhang; Hong-Han Shuai; Wen-Huang Cheng

arXiv:2506.06733·cs.CV·June 12, 2025

RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation

Ruoxuan Zhang, Jidong Gao, Bin Wen, Hongxia Xie, Chenming Zhang, Hong-Han Shuai, Wen-Huang Cheng

PDF

Open Access

TL;DR

RecipeGen introduces a comprehensive, large-scale benchmark dataset for multimodal recipe generation tasks, addressing the lack of fine-grained alignment between recipe instructions, images, and videos in food computing.

Contribution

It provides the first real-world, multi-modal benchmark dataset with domain-specific evaluation metrics for recipe-based text-to-image, image-to-video, and text-to-video generation.

Findings

01

Benchmark results for T2I, I2V, T2V models

02

Domain-specific metrics for ingredient fidelity

03

Insights for future recipe generation models

Abstract

Creating recipe images is a key challenge in food computing, with applications in culinary education and multimodal recipe assistants. However, existing datasets lack fine-grained alignment between recipe goals, step-wise instructions, and visual content. We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,453 recipes, 196,724 images, and 4,491 videos, covering diverse ingredients, cooking procedures, styles, and dish types. We further propose domain-specific evaluation metrics to assess ingredient fidelity and interaction modeling, benchmark representative T2I, I2V, and T2V models, and provide insights for future recipe generation models. Project page is available now.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Multisensory perception and integration