Diverse Video Captioning Through Latent Variable Expansion

Huanhou Xiao; Jinglun Shi

arXiv:1910.12019·cs.CV·June 16, 2021·1 cites

Diverse Video Captioning Through Latent Variable Expansion

Huanhou Xiao, Jinglun Shi

PDF

Open Access

TL;DR

This paper introduces a novel framework for diverse video captioning using latent variable expansion and CGANs, enabling the generation of multiple, varied descriptions for each video, which improves upon accuracy-focused methods.

Contribution

It proposes a new approach that leverages latent variables and CGANs to produce diverse video captions, addressing the lack of diversity in previous methods.

Findings

01

Generates diverse video descriptions effectively.

02

Achieves superior results on benchmark datasets.

03

Introduces a new DCE metric for caption diversity.

Abstract

Automatically describing video content with text description is challenging but important task, which has been attracting a lot of attention in computer vision community. Previous works mainly strive for the accuracy of the generated sentences, while ignoring the sentences diversity, which is inconsistent with human behavior. In this paper, we aim to caption each video with multiple descriptions and propose a novel framework. Concretely, for a given video, the intermediate latent variables of conventional encode-decode process are utilized as input to the conditional generative adversarial network (CGAN) with the purpose of generating diverse sentences. We adopt different Convolutional Neural Networks (CNNs) as our generator that produces descriptions conditioned on latent variables and discriminator that assesses the quality of generated sentences. Simultaneously, a novel DCE metric is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition