UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Tian Ye; Song Fei; Lei Zhu

arXiv:2511.18050·cs.CV·November 25, 2025

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Tian Ye, Song Fei, Lei Zhu

PDF

Open Access 2 Models

TL;DR

UltraFlux introduces a comprehensive data-model co-design approach for native 4K text-to-image generation, addressing multiple failure modes to achieve high-quality, diverse aspect ratio outputs with superior fidelity and aesthetics.

Contribution

It presents UltraFlux, a novel 4K diffusion transformer with integrated positional encoding, VAE, and training strategies, enabling stable, high-quality native 4K image synthesis across diverse aspect ratios.

Findings

01

Outperforms open-source baselines in fidelity and aesthetics.

02

Matches or surpasses proprietary Seedream 4.0 in quality.

03

Demonstrates stable, detail-preserving 4K generation across aspect ratios.

Abstract

Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications