Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models
Dong Chen, Fangyun Wei, Ziyu Wan, Dongdong Chen, Jiawei Zhang, Jinjing Zhao, Sirui Zhang, Yang Yue, Zhiyang Liang, Baining Guo, Chong Luo, Jianmin Bao, Ji Li, Lei Shi, Qinhong Yang, Xiuyu Wu, Xuelu Feng, Yan Lu, Yanchen Dong, Yitong Wang, Yunuo Chen

TL;DR
Lens is a compact, efficient text-to-image model that achieves competitive performance with less training compute by leveraging rich data, architectural innovations, and systematic optimization techniques.
Contribution
The paper introduces Lens, a 3.8B-parameter T2I model that surpasses larger models in efficiency and performance through novel training strategies and architectural choices.
Findings
Lens requires only 19.3% of the training compute of Z-Image.
Lens generalizes across various aspect ratios and resolutions up to 1440^2.
The turbo version generates images in 0.84 seconds on a single GPU.
Abstract
We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/Lensmodel· 1.3k dl· ♡ 1441.3k dl♡ 144
- 🤗microsoft/Lens-Turbomodel· 1.7k dl· ♡ 1281.7k dl♡ 128
- 🤗microsoft/Lens-Basemodel· 481 dl· ♡ 17481 dl♡ 17
- 🤗vantagewithai/Lens-Turbo-GGUF-ComfyUImodel· 1.1k dl· ♡ 21.1k dl♡ 2
- 🤗ngoctham/Lensmodel· 14 dl· ♡ 114 dl♡ 1
- 🤗YuCollection/Lens-Base-Diffusersmodel· 10 dl10 dl
- 🤗YuCollection/Lens-Diffusersmodel· 14 dl14 dl
- 🤗YuCollection/Lens-Turbo-Diffusersmodel· 14 dl14 dl
- 🤗vantagewithai/Lens-GGUF-ComfyUImodel· 850 dl850 dl
- 🤗Langitzt/Lens-Turbomodel· 16 dl16 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
