
TL;DR
This paper reverse-engineers the FLUX diffusion model to reveal its architecture, aiding future research despite the lack of official documentation, and demonstrates its state-of-the-art performance in text-to-image generation.
Contribution
It provides the first detailed technical analysis of FLUX's architecture through source code reverse-engineering, facilitating its adoption in research.
Findings
FLUX outperforms Midjourney, DALL-E 3, SD3, and SDXL in text-to-image tasks.
The report offers an unofficial, detailed architecture overview of FLUX.
It enables future research by clarifying the model's design despite limited official info.
Abstract
FLUX.1 is a diffusion-based text-to-image generation model developed by Black Forest Labs, designed to achieve faithful text-image alignment while maintaining high image quality and diversity. FLUX is considered state-of-the-art in text-to-image generation, outperforming popular models such as Midjourney, DALL-E 3, Stable Diffusion 3 (SD3), and SDXL. Although publicly available as open source, the authors have not released official technical documentation detailing the model's architecture or training setup. This report summarizes an extensive reverse-engineering effort aimed at demystifying FLUX's architecture directly from its source code, to support its adoption as a backbone for future research and development. This document is an unofficial technical report and is not published or endorsed by the original developers or their affiliated institutions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Humanities and Scholarship · Generative Adversarial Networks and Image Synthesis · Mathematics, Computing, and Information Processing
