How Important are Videos for Training Video LLMs?

George Lydakis; Alexander Hermans; Ali Athar; Daan de Geus; Bastian Leibe

arXiv:2506.06928·cs.CV·June 10, 2025

How Important are Videos for Training Video LLMs?

George Lydakis, Alexander Hermans, Ali Athar, Daan de Geus, Bastian Leibe

PDF

Open Access

TL;DR

Video LLMs can perform strong temporal reasoning after image-only training, with minimal gains from video-specific training, indicating current methods may not fully utilize temporal information in videos.

Contribution

The paper reveals that image-trained LLMs exhibit significant temporal reasoning abilities and introduces a simple finetuning scheme that rivals video-trained models.

Findings

01

Image-trained LLMs perform above chance on temporal reasoning tasks.

02

Video-specific training yields surprisingly small improvements.

03

A simple finetuning scheme achieves comparable or better results than video training.

Abstract

Research into Video Large Language Models (LLMs) has progressed rapidly, with numerous models and benchmarks emerging in just a few years. Typically, these models are initialized with a pretrained text-only LLM and finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video-specific training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recent LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Additionally, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in temporal reasoning performance close to, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning