Video-Teller: Enhancing Cross-Modal Generation with Fusion and   Decoupling

Haogeng Liu; Qihang Fan; Tingkai Liu; Linjie Yang; Yunzhe Tao; Huaibo; Huang; Ran He; Hongxia Yang

arXiv:2310.04991·cs.CV·October 12, 2023·1 cites

Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling

Haogeng Liu, Qihang Fan, Tingkai Liu, Linjie Yang, Yunzhe Tao, Huaibo, Huang, Ran He, Hongxia Yang

PDF

Open Access

TL;DR

Video-Teller introduces a novel video-language foundation model that fuses multi-modal information and employs fine-grained alignment to improve video-to-text generation, achieving higher accuracy with minimal additional computational cost.

Contribution

The paper presents a new model that combines multi-modal fusion and fine-grained alignment, leveraging frozen pretrained modules for efficient and accurate video description generation.

Findings

01

Achieves 4% CIDEr score improvement on MSR-VTT

02

Utilizes cascaded Q-Former for cross-modal fusion

03

Enhances video summarization with minimal extra parameters

Abstract

This paper proposes Video-Teller, a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment to significantly enhance the video-to-text generation task. Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules. It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions. To effectively integrate visual and auditory information, Video-Teller builds upon the image-based BLIP-2 model and introduces a cascaded Q-Former which fuses information across frames and ASR texts. To better guide video summarization, we introduce a fine-grained modality alignment objective, where the cascaded Q-Former's output embedding is trained to align with the caption/summary embedding created by a pretrained text auto-encoder.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsALIGN