MakeAnything: Harnessing Diffusion Transformers for Multi-Domain   Procedural Sequence Generation

Yiren Song; Cheng Liu; Mike Zheng Shou

arXiv:2502.01572·cs.CV·February 6, 2025

MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

Yiren Song, Cheng Liu, Mike Zheng Shou

PDF

Open Access

TL;DR

MakeAnything leverages diffusion transformers and a new dataset to generate multi-step procedural sequences across various domains, addressing data scarcity, consistency, and generalization challenges.

Contribution

The paper introduces MakeAnything, a diffusion transformer-based framework with novel LoRA adaptation and image-to-process capabilities for multi-domain procedural sequence generation.

Findings

01

Outperforms existing methods on procedural generation benchmarks

02

Creates consistent multi-step sequences across 21 diverse tasks

03

Enables image-to-process generation with spatiotemporal coherence

Abstract

A hallmark of human intelligence is the ability to create complex artifacts through structured multi-step processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis