Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and   Latent Diffusion

Anton Razzhigaev; Arseniy Shakhmatov; Anastasia Maltseva; Vladimir; Arkhipkin; Igor Pavlov; Ilya Ryabov; Angelina Kuts; Alexander Panchenko,; Andrey Kuznetsov; Denis Dimitrov

arXiv:2310.03502·cs.CV·October 6, 2023·1 cites

Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir, Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko,, Andrey Kuznetsov, Denis Dimitrov

PDF

Open Access 1 Repo

TL;DR

Kandinsky1 is a new latent diffusion-based text-to-image model that combines image prior techniques with a modified autoencoder, achieving state-of-the-art quality and supporting diverse generation modes.

Contribution

The paper introduces Kandinsky1, integrating image prior models with latent diffusion and a modified MoVQ autoencoder, along with a user-friendly demo and open-source release.

Findings

01

Achieved an FID score of 8.03 on COCO-30K dataset.

02

Demonstrated high-quality image generation with diverse modes.

03

Outperformed existing open-source models in quality.

Abstract

Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai-forever/Kandinsky-2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

MethodsFast Attention Via Positive Orthogonal Random Features · Diffusion · Contrastive Language-Image Pre-training · Performer