UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
Tian Ye, Song Fei, Lei Zhu

TL;DR
UltraFlux introduces a comprehensive data-model co-design approach for native 4K text-to-image generation, addressing multiple failure modes to achieve high-quality, diverse aspect ratio outputs with superior fidelity and aesthetics.
Contribution
It presents UltraFlux, a novel 4K diffusion transformer with integrated positional encoding, VAE, and training strategies, enabling stable, high-quality native 4K image synthesis across diverse aspect ratios.
Findings
Outperforms open-source baselines in fidelity and aesthetics.
Matches or surpasses proprietary Seedream 4.0 in quality.
Demonstrates stable, detail-preserving 4K generation across aspect ratios.
Abstract
Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications
