TL;DR
This paper introduces a novel framework for video captioning that enhances object proposals, extracts high-level visual semantics, and validates generated sentences, leading to significant improvements over existing methods.
Contribution
It proposes a joint framework with a Conditional Graph, Latent Proposal Aggregation, and a Discriminative Language Validator for improved video captioning.
Findings
Significant improvements on MVSD and MSR-VTT datasets.
Enhanced BLEU-4 and CIDEr scores.
Effective preservation of key semantic concepts.
Abstract
Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level interactions and frame-level information from complex spatio-temporal data to generate semantic-rich captions. Our main contribution is to identify three key problems in a joint framework for future video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional Graph that can fuse spatio-temporal information into latent object proposal. 2) Visual Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words with higher semantic levels. 3) Sentence Validation: A novel Discriminative Language Validator is proposed to verify generated captions so that key semantic concepts can be effectively preserved. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
