The Design Space of Tri-Modal Masked Diffusion Models

Louis Bethune; Victor Turrisi; Bruno Kacper Mlodozeniec; Pau Rodriguez Lopez; Lokesh Boominathan; Nikhil Bhendawade; Amitis Shidani; Joris Pelemans; Theo X. Olausson; Devon Hjelm; Paul Dixon; Joao Monteiro; Pierre Ablin; Vishnu Banna; Arno Blaas; Nick Henderson; Kari Noriy; Dan Busbridge; Josh Susskind; Marco Cuturi; Irina Belousova; Luca Zappella; Russ Webb; Jason Ramapuram

arXiv:2602.21472·cs.LG·February 26, 2026

The Design Space of Tri-Modal Masked Diffusion Models

Louis Bethune, Victor Turrisi, Bruno Kacper Mlodozeniec, Pau Rodriguez Lopez, Lokesh Boominathan, Nikhil Bhendawade, Amitis Shidani, Joris Pelemans, Theo X. Olausson, Devon Hjelm, Paul Dixon, Joao Monteiro, Pierre Ablin, Vishnu Banna, Arno Blaas, Nick Henderson, Kari Noriy

PDF

Open Access

TL;DR

This paper introduces the first tri-modal masked diffusion model trained from scratch on text, image-text, and audio-text data, analyzing its scaling laws, optimizing inference, and demonstrating strong multimodal generation capabilities.

Contribution

It presents a novel tri-modal diffusion model, a stochastic reparameterization for batch size independence, and comprehensive analysis of multimodal scaling behaviors.

Findings

01

Pretrained a 3B-parameter tri-modal model on 6.4T tokens.

02

Achieved strong results in text, image, and speech tasks.

03

Provided insights into multimodal scaling laws and optimization techniques.

Abstract

Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuning a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis · Speech and Audio Processing