Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space
Liwei Wang, Alexander G. Schwing, Svetlana Lazebnik

TL;DR
This paper introduces structured latent space models for image captioning using CVAEs, significantly improving diversity and accuracy of generated descriptions by employing Gaussian mixture and additive Gaussian priors.
Contribution
The paper proposes two novel models with structured latent spaces—GMM prior and additive Gaussian prior—that enhance caption diversity and accuracy over standard CVAEs.
Findings
Both models outperform baseline CVAEs in diversity and accuracy.
AG-CVAE shows particularly strong results in caption quality.
Structured priors enable better modeling of multiple content types in images.
Abstract
This paper explores image caption generation using conditional variational auto-encoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a "vanilla" CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
