Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

Anton Voronov; Denis Kuznedelev; Mikhail Khoroshikh; Valentin; Khrulkov; Dmitry Baranchuk

arXiv:2412.01819·cs.CV·March 21, 2025

Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin, Khrulkov, Dmitry Baranchuk

PDF

Open Access 3 Models

TL;DR

Switti introduces a scale-wise transformer for text-to-image synthesis that is faster, more memory-efficient, and produces higher quality images by removing unnecessary causal constraints and guidance at high resolutions.

Contribution

This paper proposes a non-causal scale-wise transformer architecture for T2I generation, improving speed and quality over previous autoregressive models.

Findings

01

Achieves ~21% faster sampling with lower memory usage.

02

Disabling classifier-free guidance at high resolutions improves detail and speed.

03

Outperforms existing autoregressive models and rivals state-of-the-art diffusion models.

Abstract

This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~32% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion