Grafting Pre-trained Models for Multimodal Headline Generation
Lingfeng Qiao, Chen Wu, Ye Liu, Haoyuan Peng, Di Yin, Bo Ren

TL;DR
This paper introduces a novel method for multimodal headline generation by grafting a pre-trained video encoder onto a language model and using a consensus fusion mechanism, achieving strong results on real-world data.
Contribution
It proposes a new approach to combine pre-trained video and language models for multimodal headline generation, addressing modality balance challenges.
Findings
Grafted model outperforms existing methods on a new real-world dataset.
The consensus fusion mechanism effectively integrates multimodal information.
The approach demonstrates strong practical applicability in real-world scenarios.
Abstract
Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos. Due to a lack of large-scale, manually annotated data, the task of annotating grounded headlines for video is labor intensive and impractical. Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks. However, none of them can be directly applied to multimodal headline architecture where we need both multimodal encoder and sentence decoder. A major challenge in simply gluing language model and video-language model is the modality balance, which is aimed at combining visual-language complementary abilities. In this paper, we propose a novel approach to graft the video encoder from the pre-trained video-language model on the generative pre-trained language model. We also present a consensus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Video Analysis and Summarization
MethodsNone
