TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, Marie-Francine Moens

TL;DR
TS-LLaVA introduces a novel training-free approach for video understanding by constructing visual tokens through a Thumbnail-and-Sampling strategy, leveraging image-text data to achieve state-of-the-art performance in video LLM tasks.
Contribution
The paper proposes TS-LLaVA, a new method that constructs visual tokens from videos using a thumbnail and sampling approach, enabling training-free video LLMs with superior performance.
Findings
Outperforms existing training-free video LLMs on benchmarks.
34B model surpasses GPT-4V on MVBench.
Achieves comparable results to 72B training-based Video-LLaMA2.
Abstract
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In contrast, paired image-text data are much easier to obtain, and there is substantial similarity between images and videos. Consequently, extending image LLMs for video understanding tasks presents an appealing alternative. Developing effective strategies for compressing visual tokens from multiple frames is a promising way to leverage the powerful pre-trained image LLM. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM. The findings lead to our method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy. Given a video,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Human Pose and Action Recognition
