LinFusion: 1 GPU, 1 Minute, 16K Image
Songhua Liu, Weihao Yu, Zhenxiong Tan, Xinchao Wang

TL;DR
LinFusion introduces a linear attention mechanism that enables high-resolution image generation, such as 16K images, on a single GPU with minimal training, outperforming traditional quadratic complexity models.
Contribution
The paper proposes a generalized linear attention paradigm, leveraging insights from existing models, to significantly reduce computational complexity while maintaining or improving image generation quality.
Findings
Achieves 16K image generation on a single GPU
Matches or surpasses StableDiffusion performance after modest training
Highly compatible with existing SD components and pipelines
Abstract
Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention…
Peer Reviews
Decision·Submitted to ICLR 2025
- The motivation is clear and well-founded, with a thorough analysis of existing linear attention mechanisms to identify the key factors contributing to their effectiveness in diffusion. - Extensive experiments across various applications support the claims that LinFusion is both efficient and generalizable to different diffusion models as well as existing training and testing pipelines. - Overall, the writing is fluent and easy to follow, with informative figures that provide ample supporting i
- The comparisons are conducted only during the sampling stage. Since the proposed LinFusion module may also provide similar benefits during training, are there any metrics available for this stage? - Related to the previous point, the paper includes only fine-tuning experiments. It would be valuable to investigate whether training a diffusion model from scratch with LinFusion replacing self-attention results in any performance drop. If so, what is the extent of this drop? Experiments on a class
Compared to the original SD-v1.5, LinFusion offers significant advantages in speed and GPU memory usage for generating high-resolution images. The extensive amount of open-sourcing and experiment reproducibility is greatly appreciated.
The comparison experiments in the paper are not comprehensive; for instance, the experimental section lacks an analysis of parameters and data size. It is unclear whether LinFusion can outperform the latest lightweight diffusion methods, such as BK-SDM[1], on the COCO 256×256 30K dataset. In Table 7, LinFusion shows a significant decrease in FID scores. In contrast, LinFusion exhibits better compatibility with other components and pipelines of SD, which would be better to analyze why this
- The paper presents an efficient text-to-image model, LinFusion, which innovatively addresses the computational inefficiencies inherent in high-resolution image generation with diffusion models. - Two notable innovations of LinFusion are normalization-aware mamba and non-causal mamba, which significantly improving the model's performance. - The authors have conducted an extensive set of experiments, demonstrating LinFusion's effectiveness across various resolutions and showcasing its superior c
- While LinFusion demonstrates significant improvements in computational efficiency, mamba2 is designed for language models. Could you give more comparison with state-of-the-art linear attention methods[1,2,3] in computer vision. - The results of the experiment are unconvincing. Could provide a more holistic assessment of LinFusion's performance across different aspects of image generation., such as HPSv2,T2I_Combench,DPG? - Could the linear attention combined with the MM-DiT blocks,which are
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors
MethodsSoftmax · Attention Is All You Need · Diffusion
