Tarsier2: Advancing Large Vision-Language Models from Detailed Video   Description to Comprehensive Video Understanding

Liping Yuan; Jiawei Wang; Haomiao Sun; Yuchen Zhang; Yuan Lin

arXiv:2501.07888·cs.CV·January 27, 2025

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, Yuan Lin

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

Tarsier2 is a large vision-language model that significantly improves detailed video description and understanding by scaling data, fine-grained temporal alignment, and model-based sampling, outperforming existing models across multiple benchmarks.

Contribution

The paper introduces Tarsier2, a novel LVLM with enhanced training strategies and larger data scale, achieving state-of-the-art performance in comprehensive video understanding tasks.

Findings

01

Outperforms GPT-4o and Gemini-1.5-Pro in detailed video description.

02

Achieves new SOTA across 15 public benchmarks.

03

Demonstrates versatility in various video understanding tasks.

Abstract

We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8% over GPT-4o and 5.8% over Gemini-1.5-Pro. In human side-by-side evaluations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/tarsier
pytorchOfficial

Models

Datasets

omni-research/Tarsier2-Recap-585K
dataset· 8.4k dl
8.4k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsDirect Preference Optimization