Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning
Longzhen Yang, Yihang Liu, Yitao Peng, Lianghua He

TL;DR
This paper introduces the Variational Transformer framework that effectively balances accuracy and diversity in image captioning by leveraging novel priors and reward strategies, surpassing previous trade-offs.
Contribution
It proposes a new Variational Transformer model with Invisible Information Prior, Auto-selectable GMM, and Range-Median Reward to improve both accuracy and diversity in image captioning.
Findings
Achieves up to 1.1% improvement in CIDEr score.
Enhances diversity with a 4.8% increase in self-CIDEr.
Performs close to human in semantic retrieval tasks.
Abstract
Accuracy and Diversity are two essential metrizable manifestations in generating natural and semantically correct captions. Many efforts have been made to enhance one of them with another decayed due to the trade-off gap. In this work, we will show that the inferior standard of accuracy draws from human annotations (leave-one-out) are not appropriate for machine-generated captions. To improve diversity with a solid accuracy performance, we exploited a novel Variational Transformer framework. By introducing the "Invisible Information Prior" and the "Auto-selectable GMM", we instruct the encoder to learn the precise language information and object relation in different scenes for accuracy assurance. By introducing the "Range-Median Reward" baseline, we retain more diverse candidates with higher rewards during the RL-based training process for diversity assurance. Experiments show that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Position-Wise Feed-Forward Layer · Byte Pair Encoding
