TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment
Wei Li, Hehe Fan, Yongkang Wong, Mohan Kankanhalli, Yi Yang

TL;DR
TOPA introduces a novel method to extend large language models for video understanding by generating textual video data and aligning it with real videos using CLIP, eliminating the need for video pre-training.
Contribution
The paper proposes Text-Only Pre-Alignment (TOPA), a new approach that enables large language models to understand videos without training on real video data.
Findings
TOPA-Llama2-13B achieves 51.0% Top-1 accuracy on Egoschema.
TOPA surpasses previous video-text pre-training methods.
The framework is effective without any real video data pre-training.
Abstract
Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding, without the need for pre-training on real video data. Specifically, we first employ an advanced LLM to automatically generate Textual Videos comprising continuous textual frames, along with corresponding annotations to simulate real video-text data. Then, these annotated textual videos are used to pre-align a language-only LLM with the video modality. To bridge the gap between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Adam · Dropout · Softmax · ALIGN
