Aligning Source Visual and Target Language Domains for Unpaired Video   Captioning

Fenglin Liu; Xian Wu; Chenyu You; Shen Ge; Yuexian Zou; Xu Sun

arXiv:2211.12148·cs.CV·November 23, 2022·1 cites

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

Fenglin Liu, Xian Wu, Chenyu You, Shen Ge, Yuexian Zou, Xu Sun

PDF

Open Access

TL;DR

This paper introduces a novel unpaired video captioning approach that aligns visual and language domains, enabling effective captioning without paired data, and surpasses existing methods in performance.

Contribution

The paper proposes the UVC-VI system with Visual Injection Module and Multimodal Collaborative Encoder to improve unpaired video captioning by aligning visual and language domains end-to-end.

Findings

01

UVC-VI outperforms pipeline systems and some supervised models.

02

Equipping supervised systems with MCE improves CIDEr scores by 4-7%.

03

The approach achieves state-of-the-art results on MSVD and MSR-VTT datasets.

Abstract

Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Human Pose and Action Recognition