TOPA: Extending Large Language Models for Video Understanding via   Text-Only Pre-Alignment

Wei Li; Hehe Fan; Yongkang Wong; Mohan Kankanhalli; Yi Yang

arXiv:2405.13911·cs.CV·November 5, 2024

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

Wei Li, Hehe Fan, Yongkang Wong, Mohan Kankanhalli, Yi Yang

PDF

Open Access 1 Repo

TL;DR

TOPA introduces a novel method to extend large language models for video understanding by generating textual video data and aligning it with real videos using CLIP, eliminating the need for video pre-training.

Contribution

The paper proposes Text-Only Pre-Alignment (TOPA), a new approach that enables large language models to understand videos without training on real video data.

Findings

01

TOPA-Llama2-13B achieves 51.0% Top-1 accuracy on Egoschema.

02

TOPA surpasses previous video-text pre-training methods.

03

The framework is effective without any real video data pre-training.

Abstract

Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding, without the need for pre-training on real video data. Specifically, we first employ an advanced LLM to automatically generate Textual Videos comprising continuous textual frames, along with corresponding annotations to simulate real video-text data. Then, these annotated textual videos are used to pre-align a language-only LLM with the video modality. To bridge the gap between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dhg-wei/topa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Adam · Dropout · Softmax · ALIGN