How Important are Videos for Training Video LLMs?
George Lydakis, Alexander Hermans, Ali Athar, Daan de Geus, Bastian Leibe

TL;DR
Video LLMs can perform strong temporal reasoning after image-only training, with minimal gains from video-specific training, indicating current methods may not fully utilize temporal information in videos.
Contribution
The paper reveals that image-trained LLMs exhibit significant temporal reasoning abilities and introduces a simple finetuning scheme that rivals video-trained models.
Findings
Image-trained LLMs perform above chance on temporal reasoning tasks.
Video-specific training yields surprisingly small improvements.
A simple finetuning scheme achieves comparable or better results than video training.
Abstract
Research into Video Large Language Models (LLMs) has progressed rapidly, with numerous models and benchmarks emerging in just a few years. Typically, these models are initialized with a pretrained text-only LLM and finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video-specific training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recent LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Additionally, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in temporal reasoning performance close to, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
