A Picture is Worth a Thousand Words: A Unified System for Diverse   Captions and Rich Images Generation

Yupan Huang; Bei Liu; Jianlong Fu; Yutong Lu

arXiv:2110.09756·cs.CV·October 20, 2021·1 cites

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation

Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a unified multi-modal AI system that generates diverse image captions and rich images from multiple captions or images, leveraging a Transformer-based framework for real-time, comprehensive image and text creation.

Contribution

The work presents a novel unified Transformer-based framework that jointly models image and text representations to generate diverse captions and rich images simultaneously.

Findings

01

Supports real-time inference with non-autoregressive decoding

02

Produces diverse captions reflecting multiple input captions

03

Creates rich images that faithfully depict multiple captions

Abstract

A creative image-and-text generative AI system mimics humans' extraordinary abilities to provide users with diverse and comprehensive caption suggestions, as well as rich image creations. In this work, we demonstrate such an AI creation system to produce both diverse captions and rich images. When users imagine an image and associate it with multiple captions, our system paints a rich image to reflect all captions faithfully. Likewise, when users upload an image, our system depicts it with multiple diverse captions. We propose a unified multi-modal framework to achieve this goal. Specifically, our framework jointly models image-and-text representations with a Transformer network, which supports rich image creation by accepting multiple captions as input. We consider the relations among input captions to encourage diversity in training and adopt a non-autoregressive decoding strategy to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

researchmm/generate-it
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Residual Connection · Adam · Label Smoothing