Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
Sishuo Chen, Lei Li, Shuhuai Ren, Rundong Gao, Yuanxin Liu, Xiaohan, Bi, Xu Sun, Lu Hou

TL;DR
This paper introduces MR-VPC, a robust multimodal video paragraph captioning framework that effectively handles missing modalities using data augmentation and knowledge distillation, improving performance in real-world scenarios.
Contribution
The paper proposes a novel framework and architecture for resilient multimodal video captioning, incorporating DropAM and DistillAM to handle incomplete data effectively.
Findings
MR-VPC outperforms existing models on YouCook2 and ActivityNet Captions datasets.
The framework maintains high performance even with missing modalities.
DropAM and DistillAM enhance learning in modality-deficient environments.
Abstract
Video paragraph captioning (VPC) involves generating detailed narratives for long videos, utilizing supportive modalities such as speech and event boundaries. However, the existing models are constrained by the assumption of constant availability of a single auxiliary modality, which is impractical given the diversity and unpredictable nature of real-world scenarios. To this end, we propose a Missing-Resistant framework MR-VPC that effectively harnesses all available auxiliary inputs and maintains resilience even in the absence of certain modalities. Under this framework, we propose the Multimodal VPC (MVPC) architecture integrating video, speech, and event boundary inputs in a unified manner to process various auxiliary inputs. Moreover, to fortify the model against incomplete data, we introduce DropAM, a data augmentation strategy that randomly omits auxiliary inputs, paired with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition
