Towards Multimodal Video Paragraph Captioning Models Robust to Missing   Modality

Sishuo Chen; Lei Li; Shuhuai Ren; Rundong Gao; Yuanxin Liu; Xiaohan; Bi; Xu Sun; Lu Hou

arXiv:2403.19221·cs.CV·March 29, 2024·2 cites

Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality

Sishuo Chen, Lei Li, Shuhuai Ren, Rundong Gao, Yuanxin Liu, Xiaohan, Bi, Xu Sun, Lu Hou

PDF

Open Access 1 Repo

TL;DR

This paper introduces MR-VPC, a robust multimodal video paragraph captioning framework that effectively handles missing modalities using data augmentation and knowledge distillation, improving performance in real-world scenarios.

Contribution

The paper proposes a novel framework and architecture for resilient multimodal video captioning, incorporating DropAM and DistillAM to handle incomplete data effectively.

Findings

01

MR-VPC outperforms existing models on YouCook2 and ActivityNet Captions datasets.

02

The framework maintains high performance even with missing modalities.

03

DropAM and DistillAM enhance learning in modality-deficient environments.

Abstract

Video paragraph captioning (VPC) involves generating detailed narratives for long videos, utilizing supportive modalities such as speech and event boundaries. However, the existing models are constrained by the assumption of constant availability of a single auxiliary modality, which is impractical given the diversity and unpredictable nature of real-world scenarios. To this end, we propose a Missing-Resistant framework MR-VPC that effectively harnesses all available auxiliary inputs and maintains resilience even in the absence of certain modalities. Under this framework, we propose the Multimodal VPC (MVPC) architecture integrating video, speech, and event boundary inputs in a unified manner to process various auxiliary inputs. Moreover, to fortify the model against incomplete data, we introduce DropAM, a data augmentation strategy that randomly omits auxiliary inputs, paired with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lancopku/mr-vpc
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition