Video Captioning Using Weak Annotation
Jingyi Hou, Yunde Jia, Xinxiao wu, Yayun Qi

TL;DR
This paper introduces a novel method for video captioning that leverages weak annotations, such as semantic concepts, and employs progressive reasoning and dependency trees to generate fine sentences, achieving competitive results with less annotation effort.
Contribution
It proposes a progressive visual reasoning approach with dependency trees and iterative refinement to train video captioning models using weak annotations instead of strong, high-quality sentences.
Findings
The method achieves performance comparable to state-of-the-art models using strong annotations.
Dependency trees help model semantic relationships for sentence generation.
Iterative refinement improves caption quality over training iterations.
Abstract
Video captioning has shown impressive progress in recent years. One key reason of the performance improvements made by existing methods lie in massive paired video-sentence data, but collecting such strong annotation, i.e., high-quality sentences, is time-consuming and laborious. It is the fact that there now exist an amazing number of videos with weak annotation that only contains semantic concepts such as actions and objects. In this paper, we investigate using weak annotation instead of strong annotation to train a video captioning model. To this end, we propose a progressive visual reasoning method that progressively generates fine sentences from weak annotations by inferring more semantic concepts and their dependency relationships for video captioning. To model concept relationships, we use dependency trees that are spanned by exploiting external knowledge from large sentence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
