Grafting Pre-trained Models for Multimodal Headline Generation

Lingfeng Qiao; Chen Wu; Ye Liu; Haoyuan Peng; Di Yin; Bo Ren

arXiv:2211.07210·cs.CV·November 15, 2022

Grafting Pre-trained Models for Multimodal Headline Generation

Lingfeng Qiao, Chen Wu, Ye Liu, Haoyuan Peng, Di Yin, Bo Ren

PDF

Open Access

TL;DR

This paper introduces a novel method for multimodal headline generation by grafting a pre-trained video encoder onto a language model and using a consensus fusion mechanism, achieving strong results on real-world data.

Contribution

It proposes a new approach to combine pre-trained video and language models for multimodal headline generation, addressing modality balance challenges.

Findings

01

Grafted model outperforms existing methods on a new real-world dataset.

02

The consensus fusion mechanism effectively integrates multimodal information.

03

The approach demonstrates strong practical applicability in real-world scenarios.

Abstract

Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos. Due to a lack of large-scale, manually annotated data, the task of annotating grounded headlines for video is labor intensive and impractical. Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks. However, none of them can be directly applied to multimodal headline architecture where we need both multimodal encoder and sentence decoder. A major challenge in simply gluing language model and video-language model is the modality balance, which is aimed at combining visual-language complementary abilities. In this paper, we propose a novel approach to graft the video encoder from the pre-trained video-language model on the generative pre-trained language model. We also present a consensus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Video Analysis and Summarization

MethodsNone