Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding
Chang Liu, Henghui Ding, Nikhila Ravi, Yunchao Wei, Shuting He, Song Bai, Philip Torr, Leilei Cao, Jinrong Zhang, Deshui Miao, Xusheng He, Dengxian Gong, Zhiyu Wang, Mingqi Gao, Jihwan Hong, Canyang Wu, Weili Guan, Jianlong Wu, Liqiang Nie, Xingsen Huang, Yameng Gu, Xiaogang Yu

TL;DR
This report summarizes the objectives, datasets, and top methodologies of the 2026 PVUW Challenge, which evaluates multimodal pixel-level video understanding in highly unconstrained environments across three specialized tracks.
Contribution
It introduces new challenging datasets, three novel tracks including audio-driven segmentation, and provides a comprehensive analysis of state-of-the-art multimodal solutions.
Findings
Introduction of challenging new datasets for pixel-level understanding.
Development of three specialized tracks including audio-driven segmentation.
Analysis of cutting-edge multimodal solutions and future directions.
Abstract
This report summarizes the objectives, datasets, and top-performing methodologies of the 2026 Pixel-level Video Understanding in the Wild (PVUW) Challenge, hosted at CVPR 2026, which evaluates state-of-the-art models under highly unconstrained conditions. To provide a comprehensive assessment, the 2026 edition features three specialized tracks: the MOSE track for tracking objects within densely cluttered and severely occluded scenarios; the MeViS-Text track for localizing targets via motion-focused linguistic expressions; and the newly inaugurated MeViS-Audio track, which pioneers acoustic-driven object segmentation. By introducing previously unreleased challenging data and analyzing the cutting-edge, multimodal solutions submitted by participants, this report highlights the community's latest technical advancements and charts promising future directions for robust video scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
