Tarsier: Recipes for Training and Evaluating Large Video Description   Models

Jiawei Wang; Liping Yuan; Yuchen Zhang; Haomiao Sun

arXiv:2407.00634·cs.CV·September 25, 2024·2 cites

Tarsier: Recipes for Training and Evaluating Large Video Description Models

Jiawei Wang, Liping Yuan, Yuchen Zhang, Haomiao Sun

PDF

Open Access 1 Repo 3 Models 1 Datasets

TL;DR

Tarsier introduces a large-scale video-language model with a novel training approach that significantly outperforms existing open-source models in generating detailed video descriptions and achieves state-of-the-art results across multiple benchmarks.

Contribution

The paper presents Tarsier, a new large-scale video description model with a two-stage training process and introduces the DREAM-1K benchmark for evaluating video description quality.

Findings

01

Tarsier outperforms existing open-source models by 51.4% in human evaluations.

02

Tarsier achieves state-of-the-art results on nine public benchmarks.

03

Tarsier2 further improves performance with a 4.8% advantage over GPT-4o.

Abstract

Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a $+ 51.4%$ advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a $+ 12.3%$ advantage against GPT-4V and a $- 6.7%$ disadvantage against Gemini 1.5 Pro. When upgraded to Tarsier2 by building upon SigLIP and Qwen2-7B, it further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/tarsier
pytorchOfficial

Models

Datasets

omni-research/DREAM-1K
dataset· 198 dl
198 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques