Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding

Chang Liu; Henghui Ding; Nikhila Ravi; Yunchao Wei; Shuting He; Song Bai; Philip Torr; Leilei Cao; Jinrong Zhang; Deshui Miao; Xusheng He; Dengxian Gong; Zhiyu Wang; Mingqi Gao; Jihwan Hong; Canyang Wu; Weili Guan; Jianlong Wu; Liqiang Nie; Xingsen Huang; Yameng Gu; Xiaogang Yu; Xin Li; Ming-Hsuan Yang; Sijie Li; Jungong Han; Quanzhu Niu; Shihao Chen; Yuanzheng Wu; Yikang Zhou; Tao Zhang; Haobo Yuan; Lu Qi; Shunping Ji; Chao Yang; Chao Tian; Guoqing Zhu; Kai Yang; Zhifan Mo; Haijun Zhang; Xudong Kang; Shutao Li; Jaeyoung Do

arXiv:2604.26031·cs.CV·April 30, 2026

Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding

Chang Liu, Henghui Ding, Nikhila Ravi, Yunchao Wei, Shuting He, Song Bai, Philip Torr, Leilei Cao, Jinrong Zhang, Deshui Miao, Xusheng He, Dengxian Gong, Zhiyu Wang, Mingqi Gao, Jihwan Hong, Canyang Wu, Weili Guan, Jianlong Wu, Liqiang Nie, Xingsen Huang, Yameng Gu, Xiaogang Yu

PDF

TL;DR

This report summarizes the objectives, datasets, and top methodologies of the 2026 PVUW Challenge, which evaluates multimodal pixel-level video understanding in highly unconstrained environments across three specialized tracks.

Contribution

It introduces new challenging datasets, three novel tracks including audio-driven segmentation, and provides a comprehensive analysis of state-of-the-art multimodal solutions.

Findings

01

Introduction of challenging new datasets for pixel-level understanding.

02

Development of three specialized tracks including audio-driven segmentation.

03

Analysis of cutting-edge multimodal solutions and future directions.

Abstract

This report summarizes the objectives, datasets, and top-performing methodologies of the 2026 Pixel-level Video Understanding in the Wild (PVUW) Challenge, hosted at CVPR 2026, which evaluates state-of-the-art models under highly unconstrained conditions. To provide a comprehensive assessment, the 2026 edition features three specialized tracks: the MOSE track for tracking objects within densely cluttered and severely occluded scenarios; the MeViS-Text track for localizing targets via motion-focused linguistic expressions; and the newly inaugurated MeViS-Audio track, which pioneers acoustic-driven object segmentation. By introducing previously unreleased challenging data and analyzing the cutting-edge, multimodal solutions submitted by participants, this report highlights the community's latest technical advancements and charts promising future directions for robust video scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.