Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Nikolai Vaulin, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov

TL;DR
Kandinsky 5.0 is a comprehensive family of high-resolution image and video generative models, featuring multiple sizes and optimized training techniques for superior quality and speed, with open-source availability.
Contribution
The paper introduces Kandinsky 5.0, a new family of foundation models for image and video generation, including novel training, architectural, and inference optimizations.
Findings
Achieves state-of-the-art performance in image and video synthesis
Demonstrates high generation speed and quality through novel optimizations
Provides open-source code and models for research community
Abstract
This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗kandinskylab/Kandinsky-5.0-I2V-Pro-sft-5s-Diffusersmodel· 273 dl· ♡ 27273 dl♡ 27
- 🤗kandinskylab/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusersmodel· 428 dl· ♡ 4428 dl♡ 4
- 🤗kandinskylab/Kandinsky-5.0-T2V-Lite-nocfg-5s-Diffusersmodel· 14 dl· ♡ 114 dl♡ 1
- 🤗kandinskylab/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusersmodel· 62 dl· ♡ 162 dl♡ 1
- 🤗kandinskylab/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusersmodel· 1 dl1 dl
- 🤗kandinskylab/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusersmodel· 221 dl· ♡ 2221 dl♡ 2
- 🤗kandinskylab/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusersmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗kandinskylab/Kandinsky-5.0-T2V-Lite-nocfg-10s-Diffusersmodel· 8 dl8 dl
- 🤗kandinskylab/Kandinsky-5.0-T2V-Lite-distilled16steps-10s-Diffusersmodel· 11 dl11 dl
- 🤗kandinskylab/Kandinsky-5.0-T2V-Pro-sft-5s-Diffusersmodel· 241 dl· ♡ 7241 dl♡ 7
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Human Motion and Animation
