L-Verse: Bidirectional Generation Between Image and Text
Taehoon Kim, Gwangmo Song, Sihaeng Lee, Sangyun Kim, Yewon Seo,, Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae

TL;DR
L-Verse introduces a novel bidirectional transformer architecture with an augmented VAE for high-quality image-to-text and text-to-image generation, achieving state-of-the-art results without fine-tuning.
Contribution
The paper proposes L-Verse, combining AugVAE and BiART for bidirectional generation, improving reconstruction and cross-modal generation without additional fine-tuning.
Findings
State-of-the-art reconstruction on ImageNet1K
Impressive image-to-text and text-to-image results on MS-COCO
Initial bidirectional vision-language learning on Conceptual Captions
Abstract
Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Residual Connection · Byte Pair Encoding · Multi-Head Attention · Attention Is All You Need · Transformer · VQ-VAE
