Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Hila Chefer; Patrick Esser; Dominik Lorenz; Dustin Podell; Vikash Raja; Vinh Tong; Antonio Torralba; Robin Rombach

arXiv:2603.06507·cs.CV·March 9, 2026

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, Robin Rombach

PDF

Open Access

TL;DR

This paper introduces Self-Flow, a self-supervised approach that enhances semantic representations in generative models across multiple modalities without external supervision, improving scalability and quality.

Contribution

It proposes a novel self-supervised flow matching method with Dual-Timestep Scheduling, enabling representation learning within generative models across different data types.

Findings

01

Achieves superior image, video, and audio generation results.

02

Enables multi-modal training following expected scaling laws.

03

Eliminates the need for external models for semantic representation learning.

Abstract

Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Multimodal Machine Learning Applications