Aligning Source Visual and Target Language Domains for Unpaired Video Captioning
Fenglin Liu, Xian Wu, Chenyu You, Shen Ge, Yuexian Zou, Xu Sun

TL;DR
This paper introduces a novel unpaired video captioning approach that aligns visual and language domains, enabling effective captioning without paired data, and surpasses existing methods in performance.
Contribution
The paper proposes the UVC-VI system with Visual Injection Module and Multimodal Collaborative Encoder to improve unpaired video captioning by aligning visual and language domains end-to-end.
Findings
UVC-VI outperforms pipeline systems and some supervised models.
Equipping supervised systems with MCE improves CIDEr scores by 4-7%.
The approach achieves state-of-the-art results on MSVD and MSR-VTT datasets.
Abstract
Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Human Pose and Action Recognition
