Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin, Khrulkov, Dmitry Baranchuk

TL;DR
Switti introduces a scale-wise transformer for text-to-image synthesis that is faster, more memory-efficient, and produces higher quality images by removing unnecessary causal constraints and guidance at high resolutions.
Contribution
This paper proposes a non-causal scale-wise transformer architecture for T2I generation, improving speed and quality over previous autoregressive models.
Findings
Achieves ~21% faster sampling with lower memory usage.
Disabling classifier-free guidance at high resolutions improves detail and speed.
Outperforms existing autoregressive models and rivals state-of-the-art diffusion models.
Abstract
This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~32% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
