GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video   Paragraph Captioning

Eileen Wang; Caren Han; Josiah Poon

arXiv:2410.09377·cs.CV·October 15, 2024

GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

Eileen Wang, Caren Han, Josiah Poon

PDF

Open Access

TL;DR

This paper presents GEM-VPC, a novel multimodal graph-enhanced framework for video paragraph captioning that effectively captures key events and themes, improving performance on benchmark datasets.

Contribution

It introduces a dual graph structure and a node selection module to better utilize multimodal signals and external knowledge in video captioning.

Findings

01

Achieves superior performance on benchmark datasets.

02

Effectively models key events and themes in videos.

03

Enhances decoding efficiency with node selection.

Abstract

Video Paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video. Despite recent advancements, challenges persist, notably in effectively utilising multimodal signals inherent in videos and addressing the long-tail distribution of words. The paper introduces a novel multimodal integrated caption generation framework for VPC that leverages information from various modalities and external knowledge bases. Our framework constructs two graphs: a 'video-specific' temporal graph capturing major events and interactions between multimodal information and commonsense knowledge, and a 'theme graph' representing correlations between words of a specific theme. These graphs serve as input for a transformer network with a shared encoder-decoder architecture. We also introduce a node selection module to enhance decoding efficiency by selecting the most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition