MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang

TL;DR
MeViS introduces a large-scale multi-modal dataset for referring motion expression video segmentation, emphasizing motion understanding in videos and language, and benchmarks existing methods revealing their limitations in this domain.
Contribution
The paper presents MeViS, a novel dataset with over 33,000 motion expressions in videos, and proposes LMPM++, a method that achieves state-of-the-art results in motion-guided video understanding tasks.
Findings
Existing methods show limitations in motion expression understanding.
LMPM++ outperforms previous approaches on multiple tasks.
The dataset enables new research in motion-based video analysis.
Abstract
This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Hand Gesture Recognition Systems
