VLCap: Vision-Language with Contrastive Learning for Coherent Video   Paragraph Captioning

Kashu Yamazaki; Sang Truong; Khoa Vo; Michael Kidd; Chase Rainwater,; Khoa Luu; Ngan Le

arXiv:2206.12972·cs.CV·August 9, 2022·1 cites

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

Kashu Yamazaki, Sang Truong, Khoa Vo, Michael Kidd, Chase Rainwater,, Khoa Luu, Ngan Le

PDF

Open Access 1 Repo

TL;DR

VLCap introduces a contrastive learning approach utilizing vision-language features to generate coherent paragraph descriptions for untrimmed videos, improving accuracy and diversity over existing methods.

Contribution

The paper presents a novel vision-language feature framework combined with contrastive learning for paragraph video captioning, enhancing coherence and diversity.

Findings

01

Outperforms state-of-the-art methods on ActivityNet Captions and YouCookII datasets.

02

Achieves higher accuracy and diversity in video paragraph captioning.

03

Demonstrates effectiveness of contrastive learning in multimodal video understanding.

Abstract

In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects (e.g. animals, vehicles, etc), visual and non-visual elements (e.g. relations, activities, etc). Furthermore, we propose to train our proposed VLCap under a contrastive learning VL loss. The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that our VLCap outperforms existing SOTA methods on both accuracy and diversity metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UARK-AICV/VLCAP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsContrastive Learning