LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

Jiahao Wang; Ning Kang; Lewei Yao; Mengzhao Chen; Chengyue Wu; Songyang Zhang; Shuchen Xue; Yong Liu; Taiqiang Wu; Xihui Liu; Kaipeng Zhang; Shifeng Zhang; Wenqi Shao; Zhenguo Li; Ping Luo

arXiv:2501.12976·cs.CV·September 29, 2025

LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong Liu, Taiqiang Wu, Xihui Liu, Kaipeng Zhang, Shifeng Zhang, Wenqi Shao, Zhenguo Li, Ping Luo

PDF

Open Access

TL;DR

This paper introduces LiT, a simple linear diffusion transformer for image generation that is efficient, easy to adapt from pre-trained models, and achieves performance comparable to more complex methods.

Contribution

The paper presents practical guidelines for converting pre-trained Diffusion Transformers into linear models, enabling efficient image generation with minimal training.

Findings

01

LiT achieves comparable performance to DiT with significantly less training.

02

Fewer attention heads improve performance without increasing latency.

03

LiT can be adapted quickly for class-conditional and text-to-image generation.

Abstract

In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. Through detailed exploration, we offer a suite of ready-to-use solutions, ranging from linear attention design to optimization strategies. Our core contributions include 5 practical guidelines: 1) Applying depth-wise convolution within simple linear attention is sufficient for image generation. 2) Using fewer heads in linear attention provides a free-lunch performance boost without increasing latency. 3) Inheriting weights from a fully converged, pre-trained DiT. 4) Loading all parameters except those related to linear attention. 5) Hybrid knowledge distillation: using a pre-trained teacher DiT to help the training of the student linear DiT, supervising not only the predicted noise but also the variance of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Advanced Optical Imaging Technologies · Advanced Vision and Imaging

MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer