TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for   Training-Free Video Large Language Models

Tingyu Qu; Mingxiao Li; Tinne Tuytelaars; Marie-Francine Moens

arXiv:2411.11066·cs.CV·November 19, 2024

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, Marie-Francine Moens

PDF

Open Access 1 Repo

TL;DR

TS-LLaVA introduces a novel training-free approach for video understanding by constructing visual tokens through a Thumbnail-and-Sampling strategy, leveraging image-text data to achieve state-of-the-art performance in video LLM tasks.

Contribution

The paper proposes TS-LLaVA, a new method that constructs visual tokens from videos using a thumbnail and sampling approach, enabling training-free video LLMs with superior performance.

Findings

01

Outperforms existing training-free video LLMs on benchmarks.

02

34B model surpasses GPT-4V on MVBench.

03

Achieves comparable results to 72B training-based Video-LLaMA2.

Abstract

Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In contrast, paired image-text data are much easier to obtain, and there is substantial similarity between images and videos. Consequently, extending image LLMs for video understanding tasks presents an appealing alternative. Developing effective strategies for compressing visual tokens from multiple frames is a promising way to leverage the powerful pre-trained image LLM. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM. The findings lead to our method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy. Given a video,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tingyu215/ts-llava
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Human Pose and Action Recognition