Variational Transformer: A Framework Beyond the Trade-off between   Accuracy and Diversity for Image Captioning

Longzhen Yang; Yihang Liu; Yitao Peng; Lianghua He

arXiv:2205.14458·cs.CV·September 22, 2022

Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning

Longzhen Yang, Yihang Liu, Yitao Peng, Lianghua He

PDF

Open Access 1 Repo

TL;DR

This paper introduces the Variational Transformer framework that effectively balances accuracy and diversity in image captioning by leveraging novel priors and reward strategies, surpassing previous trade-offs.

Contribution

It proposes a new Variational Transformer model with Invisible Information Prior, Auto-selectable GMM, and Range-Median Reward to improve both accuracy and diversity in image captioning.

Findings

01

Achieves up to 1.1% improvement in CIDEr score.

02

Enhances diversity with a 4.8% increase in self-CIDEr.

03

Performs close to human in semantic retrieval tasks.

Abstract

Accuracy and Diversity are two essential metrizable manifestations in generating natural and semantically correct captions. Many efforts have been made to enhance one of them with another decayed due to the trade-off gap. In this work, we will show that the inferior standard of accuracy draws from human annotations (leave-one-out) are not appropriate for machine-generated captions. To improve diversity with a solid accuracy performance, we exploited a novel Variational Transformer framework. By introducing the "Invisible Information Prior" and the "Auto-selectable GMM", we instruct the encoder to learn the precise language information and object relation in different scenes for accuracy assurance. By introducing the "Range-Median Reward" baseline, we retain more diverse candidates with higher rewards during the RL-based training process for diversity assurance. Experiments show that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaelsunkiller/vat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Position-Wise Feed-Forward Layer · Byte Pair Encoding