Syntax Customized Video Captioning by Imitating Exemplar Sentences
Yitian Yuan, Lin Ma, Wenwu Zhu

TL;DR
This paper introduces a novel task of syntax customized video captioning, enabling generation of semantically accurate and syntactically varied captions by imitating exemplar sentence structures, enhancing diversity in video descriptions.
Contribution
The paper proposes a new model with a hierarchical syntax encoder and syntax-conditioned decoder, along with a training strategy leveraging traditional caption data and exemplar sentences.
Findings
Generated captions are syntactically diverse and semantically coherent.
The model outperforms baselines in diversity and fluency evaluations.
Extensive experiments validate the effectiveness of syntax imitation in video captioning.
Abstract
Enhancing the diversity of sentences to describe video contents is an important problem arising in recent video captioning research. In this paper, we explore this problem from a novel perspective of customizing video captions by imitating exemplar sentence syntaxes. Specifically, given a video and any syntax-valid exemplar sentence, we introduce a new task of Syntax Customized Video Captioning (SCVC) aiming to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence. To tackle the SCVC task, we propose a novel video captioning model, where a hierarchical sentence syntax encoder is firstly designed to extract the syntactic structure of the exemplar sentence, then a syntax conditioned caption decoder is devised to generate the syntactically structured caption expressing video semantics. As there is no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
