RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation
Ruoxuan Zhang, Jidong Gao, Bin Wen, Hongxia Xie, Chenming Zhang, Hong-Han Shuai, Wen-Huang Cheng

TL;DR
RecipeGen introduces a comprehensive, large-scale benchmark dataset for multimodal recipe generation tasks, addressing the lack of fine-grained alignment between recipe instructions, images, and videos in food computing.
Contribution
It provides the first real-world, multi-modal benchmark dataset with domain-specific evaluation metrics for recipe-based text-to-image, image-to-video, and text-to-video generation.
Findings
Benchmark results for T2I, I2V, T2V models
Domain-specific metrics for ingredient fidelity
Insights for future recipe generation models
Abstract
Creating recipe images is a key challenge in food computing, with applications in culinary education and multimodal recipe assistants. However, existing datasets lack fine-grained alignment between recipe goals, step-wise instructions, and visual content. We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,453 recipes, 196,724 images, and 4,491 videos, covering diverse ingredients, cooking procedures, styles, and dish types. We further propose domain-specific evaluation metrics to assess ingredient fidelity and interaction modeling, benchmark representative T2I, I2V, and T2V models, and provide insights for future recipe generation models. Project page is available now.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Multisensory perception and integration
