MAD: Motion Appearance Decoupling for efficient Driving World Models
Ahmad Rahimi, Valentin Gerard, Eloi Zablocki, Matthieu Cord, Alexandre Alahi

TL;DR
This paper introduces MAD, a two-stage framework that decouples motion learning from appearance synthesis to efficiently adapt generalist video diffusion models into controllable driving world models with minimal supervision.
Contribution
It presents a novel decoupling approach that separates motion prediction from appearance rendering, enabling efficient adaptation of video models for autonomous driving tasks.
Findings
Achieves comparable performance to state-of-the-art models with less than 6% of the compute.
Outperforms open-source competitors on the LTX benchmark.
Supports diverse controls including text, ego, and object manipulation.
Abstract
Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain-specific data and costly fine-tuning. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Human Motion and Animation
