Tarsier: Recipes for Training and Evaluating Large Video Description Models
Jiawei Wang, Liping Yuan, Yuchen Zhang, Haomiao Sun

TL;DR
Tarsier introduces a large-scale video-language model with a novel training approach that significantly outperforms existing open-source models in generating detailed video descriptions and achieves state-of-the-art results across multiple benchmarks.
Contribution
The paper presents Tarsier, a new large-scale video description model with a two-stage training process and introduces the DREAM-1K benchmark for evaluating video description quality.
Findings
Tarsier outperforms existing open-source models by 51.4% in human evaluations.
Tarsier achieves state-of-the-art results on nine public benchmarks.
Tarsier2 further improves performance with a 4.8% advantage over GPT-4o.
Abstract
Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a advantage against GPT-4V and a disadvantage against Gemini 1.5 Pro. When upgraded to Tarsier2 by building upon SigLIP and Qwen2-7B, it further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
