Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling
Haogeng Liu, Qihang Fan, Tingkai Liu, Linjie Yang, Yunzhe Tao, Huaibo, Huang, Ran He, Hongxia Yang

TL;DR
Video-Teller introduces a novel video-language foundation model that fuses multi-modal information and employs fine-grained alignment to improve video-to-text generation, achieving higher accuracy with minimal additional computational cost.
Contribution
The paper presents a new model that combines multi-modal fusion and fine-grained alignment, leveraging frozen pretrained modules for efficient and accurate video description generation.
Findings
Achieves 4% CIDEr score improvement on MSR-VTT
Utilizes cascaded Q-Former for cross-modal fusion
Enhances video summarization with minimal extra parameters
Abstract
This paper proposes Video-Teller, a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment to significantly enhance the video-to-text generation task. Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules. It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions. To effectively integrate visual and auditory information, Video-Teller builds upon the image-based BLIP-2 model and introduces a cascaded Q-Former which fuses information across frames and ASR texts. To better guide video summarization, we introduce a fine-grained modality alignment objective, where the cascaded Q-Former's output embedding is trained to align with the caption/summary embedding created by a pretrained text auto-encoder.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsALIGN
