L-Verse: Bidirectional Generation Between Image and Text

Taehoon Kim; Gwangmo Song; Sihaeng Lee; Sangyun Kim; Yewon Seo,; Soonyoung Lee; Seung Hwan Kim; Honglak Lee; Kyunghoon Bae

arXiv:2111.11133·cs.CV·April 7, 2022·1 cites

L-Verse: Bidirectional Generation Between Image and Text

Taehoon Kim, Gwangmo Song, Sihaeng Lee, Sangyun Kim, Yewon Seo,, Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae

PDF

Open Access 1 Repo

TL;DR

L-Verse introduces a novel bidirectional transformer architecture with an augmented VAE for high-quality image-to-text and text-to-image generation, achieving state-of-the-art results without fine-tuning.

Contribution

The paper proposes L-Verse, combining AugVAE and BiART for bidirectional generation, improving reconstruction and cross-modal generation without additional fine-tuning.

Findings

01

State-of-the-art reconstruction on ImageNet1K

02

Impressive image-to-text and text-to-image results on MS-COCO

03

Initial bidirectional vision-language learning on Conceptual Captions

Abstract

Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tgisaturday/L-Verse
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Residual Connection · Byte Pair Encoding · Multi-Head Attention · Attention Is All You Need · Transformer · VQ-VAE