LinFusion: 1 GPU, 1 Minute, 16K Image

Songhua Liu; Weihao Yu; Zhenxiong Tan; Xinchao Wang

arXiv:2409.02097·cs.CV·October 18, 2024

LinFusion: 1 GPU, 1 Minute, 16K Image

Songhua Liu, Weihao Yu, Zhenxiong Tan, Xinchao Wang

PDF

Open Access 1 Repo 3 Models 3 Reviews

TL;DR

LinFusion introduces a linear attention mechanism that enables high-resolution image generation, such as 16K images, on a single GPU with minimal training, outperforming traditional quadratic complexity models.

Contribution

The paper proposes a generalized linear attention paradigm, leveraging insights from existing models, to significantly reduce computational complexity while maintaining or improving image generation quality.

Findings

01

Achieves 16K image generation on a single GPU

02

Matches or surpasses StableDiffusion performance after modest training

03

Highly compatible with existing SD components and pipelines

Abstract

Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 5

Strengths

- The motivation is clear and well-founded, with a thorough analysis of existing linear attention mechanisms to identify the key factors contributing to their effectiveness in diffusion. - Extensive experiments across various applications support the claims that LinFusion is both efficient and generalizable to different diffusion models as well as existing training and testing pipelines. - Overall, the writing is fluent and easy to follow, with informative figures that provide ample supporting i

Weaknesses

- The comparisons are conducted only during the sampling stage. Since the proposed LinFusion module may also provide similar benefits during training, are there any metrics available for this stage? - Related to the previous point, the paper includes only fine-tuning experiments. It would be valuable to investigate whether training a diffusion model from scratch with LinFusion replacing self-attention results in any performance drop. If so, what is the extent of this drop? Experiments on a class

Reviewer 02Rating 5Confidence 4

Strengths

 Compared to the original SD-v1.5, LinFusion offers significant advantages in speed and GPU memory usage for generating high-resolution images.  The extensive amount of open-sourcing and experiment reproducibility is greatly appreciated.

Weaknesses

 The comparison experiments in the paper are not comprehensive; for instance, the experimental section lacks an analysis of parameters and data size.  It is unclear whether LinFusion can outperform the latest lightweight diffusion methods, such as BK-SDM[1], on the COCO 256×256 30K dataset.  In Table 7, LinFusion shows a significant decrease in FID scores. In contrast, LinFusion exhibits better compatibility with other components and pipelines of SD, which would be better to analyze why this

Reviewer 03Rating 6Confidence 4

Strengths

- The paper presents an efficient text-to-image model, LinFusion, which innovatively addresses the computational inefficiencies inherent in high-resolution image generation with diffusion models. - Two notable innovations of LinFusion are normalization-aware mamba and non-causal mamba, which significantly improving the model's performance. - The authors have conducted an extensive set of experiments, demonstrating LinFusion's effectiveness across various resolutions and showcasing its superior c

Weaknesses

- While LinFusion demonstrates significant improvements in computational efficiency, mamba2 is designed for language models. Could you give more comparison with state-of-the-art linear attention methods[1,2,3] in computer vision. - The results of the experiment are unconvincing. Could provide a more holistic assessment of LinFusion's performance across different aspects of image generation., such as HPSv2，T2I_Combench，DPG？ - Could the linear attention combined with the MM-DiT blocks，which are

Code & Models

Repositories

huage001/linfusion
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors

MethodsSoftmax · Attention Is All You Need · Diffusion