Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Hyungyung Lee; Sungjin Park; Joonseok Lee; Edward Choi

arXiv:2204.07537·cs.CV·October 17, 2022·1 cites

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Hyungyung Lee, Sungjin Park, Joonseok Lee, Edward Choi

PDF

Open Access 1 Repo

TL;DR

This paper introduces MXQ-VAE, a novel multimodal cross-quantization VAE that enables unconditional generation of semantically consistent image-text pairs by learning a joint representation space.

Contribution

The paper proposes a new vector quantizer for joint image-text representations and demonstrates its effectiveness for unconditional multimodal pair generation.

Findings

01

Joint image-text representation space is effective for semantically consistent generation.

02

The method outperforms several baselines on synthetic and real-world datasets.

03

The approach enables unconditional generation of image-text pairs with semantic coherence.

Abstract

Although deep generative models have gained a lot of attention, most of the existing works are designed for unimodal generation. In this paper, we explore a new method for unconditional image-text pair generation. We design Multimodal Cross-Quantization VAE (MXQ-VAE), a novel vector quantizer for joint image-text representations, with which we discover that a joint image-text representation space is effective for semantically consistent image-text pair generation. To learn a multimodal semantic correlation in a quantized space, we combine VQ-VAE with a Transformer encoder and apply an input masking strategy. Specifically, MXQ-VAE accepts a masked image-text pair as input and learns a quantized joint representation space, so that the input can be converted to a unified code sequence, then we perform unconditional image-text pair generation with the code sequence. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ttumyche/mxq-vae
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Residual Connection · Dropout · Position-Wise Feed-Forward Layer