MarkovGen: Structured Prediction for Efficient Text-to-Image Generation

Sadeep Jayasumana; Daniel Glasner; Srikumar Ramalingam; Andreas Veit,; Ayan Chakrabarti; Sanjiv Kumar

arXiv:2308.10997·cs.CV·December 19, 2023

MarkovGen: Structured Prediction for Efficient Text-to-Image Generation

Sadeep Jayasumana, Daniel Glasner, Srikumar Ramalingam, Andreas Veit,, Ayan Chakrabarti, Sanjiv Kumar

PDF

Open Access

TL;DR

MarkovGen introduces a lightweight Markov Random Field approach to enhance text-to-image generation, significantly reducing computational costs while improving image quality and consistency across regions.

Contribution

This work presents a novel MRF-based method integrated with Muse to accelerate image generation and enhance quality, a significant improvement over existing iterative models.

Findings

01

Speeds up Muse by 1.5 times

02

Reduces undesirable image artifacts

03

Improves image consistency and quality

Abstract

Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Advanced Vision and Imaging

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings