VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning
Kashu Yamazaki, Sang Truong, Khoa Vo, Michael Kidd, Chase Rainwater,, Khoa Luu, Ngan Le

TL;DR
VLCap introduces a contrastive learning approach utilizing vision-language features to generate coherent paragraph descriptions for untrimmed videos, improving accuracy and diversity over existing methods.
Contribution
The paper presents a novel vision-language feature framework combined with contrastive learning for paragraph video captioning, enhancing coherence and diversity.
Findings
Outperforms state-of-the-art methods on ActivityNet Captions and YouCookII datasets.
Achieves higher accuracy and diversity in video paragraph captioning.
Demonstrates effectiveness of contrastive learning in multimodal video understanding.
Abstract
In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects (e.g. animals, vehicles, etc), visual and non-visual elements (e.g. relations, activities, etc). Furthermore, we propose to train our proposed VLCap under a contrastive learning VL loss. The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that our VLCap outperforms existing SOTA methods on both accuracy and diversity metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsContrastive Learning
