Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Dong Chen; Fangyun Wei; Ziyu Wan; Dongdong Chen; Jiawei Zhang; Jinjing Zhao; Sirui Zhang; Yang Yue; Zhiyang Liang; Baining Guo; Chong Luo; Jianmin Bao; Ji Li; Lei Shi; Qinhong Yang; Xiuyu Wu; Xuelu Feng; Yan Lu; Yanchen Dong; Yitong Wang; Yunuo Chen

arXiv:2605.21573·cs.CV·May 22, 2026

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Dong Chen, Fangyun Wei, Ziyu Wan, Dongdong Chen, Jiawei Zhang, Jinjing Zhao, Sirui Zhang, Yang Yue, Zhiyang Liang, Baining Guo, Chong Luo, Jianmin Bao, Ji Li, Lei Shi, Qinhong Yang, Xiuyu Wu, Xuelu Feng, Yan Lu, Yanchen Dong, Yitong Wang, Yunuo Chen

PDF

1 Repo 11 Models

TL;DR

Lens is a compact, efficient text-to-image model that achieves competitive performance with less training compute by leveraging rich data, architectural innovations, and systematic optimization techniques.

Contribution

The paper introduces Lens, a 3.8B-parameter T2I model that surpasses larger models in efficiency and performance through novel training strategies and architectural choices.

Findings

01

Lens requires only 19.3% of the training compute of Z-Image.

02

Lens generalizes across various aspect ratios and resolutions up to 1440^2.

03

The turbo version generates images in 0.84 seconds on a single GPU.

Abstract

We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/Lens
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.