MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

Henghui Ding; Chang Liu; Shuting He; Kaining Ying; Xudong Jiang; Chen Change Loy; Yu-Gang Jiang

arXiv:2512.10945·cs.CV·December 13, 2025

MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang

PDF

Open Access

TL;DR

MeViS introduces a large-scale multi-modal dataset for referring motion expression video segmentation, emphasizing motion understanding in videos and language, and benchmarks existing methods revealing their limitations in this domain.

Contribution

The paper presents MeViS, a novel dataset with over 33,000 motion expressions in videos, and proposes LMPM++, a method that achieves state-of-the-art results in motion-guided video understanding tasks.

Findings

01

Existing methods show limitations in motion expression understanding.

02

LMPM++ outperforms previous approaches on multiple tasks.

03

The dataset enables new research in motion-based video analysis.

Abstract

This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Hand Gesture Recognition Systems