CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Ruoxuan Zhang; Bin Wen; Hongxia Xie; Yi Yao; Songhan Zuo; Jian-Yu Jiang-Lin; Hong-Han Shuai; Wen-Huang Cheng

arXiv:2512.03540·cs.CV·December 8, 2025

CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang-Lin, Hong-Han Shuai, Wen-Huang Cheng

PDF

Open Access

TL;DR

CookAnything is a diffusion-based framework that generates coherent, multi-step recipe images from arbitrary-length textual instructions, addressing limitations of previous models in handling structured, variable-length procedures.

Contribution

The paper introduces three novel components—SRC, Flexible RoPE, and CSCC—that enable flexible, consistent, and semantically aligned multi-step recipe image generation.

Findings

01

Outperforms existing methods on recipe illustration benchmarks.

02

Supports scalable, high-quality image synthesis for complex instructions.

03

Maintains ingredient consistency across steps.

Abstract

Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Computer Graphics and Visualization Techniques