MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Wenyi Hong; Yean Cheng; Zhuoyi Yang; Weihan Wang; Lefan Wang; Xiaotao Gu; Shiyu Huang; Yuxiao Dong; Jie Tang

arXiv:2501.02955·cs.CV·May 13, 2026

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang

PDF

1 Repo 2 Datasets 1 Video

TL;DR

MotionBench is a new benchmark designed to evaluate and improve the ability of vision language models to understand fine-grained motion in videos, revealing current limitations and proposing methods for enhancement.

Contribution

The paper introduces MotionBench, a comprehensive benchmark for fine-grained motion understanding, and proposes a novel Through-Encoder Fusion method to improve VLMs' motion perception.

Findings

01

Existing VLMs perform poorly in fine-grained motion understanding.

02

Higher frame rates and TE Fusion improve motion perception.

03

There is significant room for improvement in current models.

Abstract

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://motion-bench.github.io
github

Datasets

Videos

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models· slideslive