MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
Yiren Song, Cheng Liu, Mike Zheng Shou

TL;DR
MakeAnything leverages diffusion transformers and a new dataset to generate multi-step procedural sequences across various domains, addressing data scarcity, consistency, and generalization challenges.
Contribution
The paper introduces MakeAnything, a diffusion transformer-based framework with novel LoRA adaptation and image-to-process capabilities for multi-domain procedural sequence generation.
Findings
Outperforms existing methods on procedural generation benchmarks
Creates consistent multi-step sequences across 21 diverse tasks
Enables image-to-process generation with spatiotemporal coherence
Abstract
A hallmark of human intelligence is the ability to create complex artifacts through structured multi-step processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
