Diverse Video Captioning Through Latent Variable Expansion
Huanhou Xiao, Jinglun Shi

TL;DR
This paper introduces a novel framework for diverse video captioning using latent variable expansion and CGANs, enabling the generation of multiple, varied descriptions for each video, which improves upon accuracy-focused methods.
Contribution
It proposes a new approach that leverages latent variables and CGANs to produce diverse video captions, addressing the lack of diversity in previous methods.
Findings
Generates diverse video descriptions effectively.
Achieves superior results on benchmark datasets.
Introduces a new DCE metric for caption diversity.
Abstract
Automatically describing video content with text description is challenging but important task, which has been attracting a lot of attention in computer vision community. Previous works mainly strive for the accuracy of the generated sentences, while ignoring the sentences diversity, which is inconsistent with human behavior. In this paper, we aim to caption each video with multiple descriptions and propose a novel framework. Concretely, for a given video, the intermediate latent variables of conventional encode-decode process are utilized as input to the conditional generative adversarial network (CGAN) with the purpose of generating diverse sentences. We adopt different Convolutional Neural Networks (CNNs) as our generator that produces descriptions conditioned on latent variables and discriminator that assesses the quality of generated sentences. Simultaneously, a novel DCE metric is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
