Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Linghao Zhang; Jungang Li; Yonghua Hei; Sicheng Tao; Song Dai; Yibo Yan; Zihao Dongfang; Weiting Liu; Chenxi Qin; Hanqian Li; Xin Zou; Jiahao Zhang; Shuhang Xun; Haiyun Jiang; Xuming Hu

arXiv:2603.17541·cs.CV·March 19, 2026

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Linghao Zhang, Jungang Li, Yonghua Hei, Sicheng Tao, Song Dai, Yibo Yan, Zihao Dongfang, Weiting Liu, Chenxi Qin, Hanqian Li, Xin Zou, Jiahao Zhang, Shuhang Xun, Haiyun Jiang, Xuming Hu

PDF

Open Access

TL;DR

This paper investigates how video-based fine-tuning affects multimodal large language models, revealing a trade-off between improved video understanding and degraded static image performance, and proposing an adaptive frame sampling strategy.

Contribution

It systematically analyzes the impact of Video-SFT on visual capabilities, highlighting the spatial-temporal trade-off and introducing an instruction-aware hybrid sampling method.

Findings

01

Video-SFT improves video performance but can degrade static image benchmarks.

02

Increasing sampled frames boosts video understanding but not static image performance.

03

Adaptive frame allocation partially mitigates the image-video trade-off.

Abstract

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning