SyncVIS: Synchronized Video Instance Segmentation
Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang, Zhao

TL;DR
SyncVIS introduces a synchronized video instance segmentation framework that explicitly models video-level and frame-level queries, improving performance on challenging benchmarks by promoting mutual learning and easier optimization.
Contribution
The paper proposes a novel synchronized modeling framework for VIS that explicitly incorporates video-level queries and synchronization modules, addressing limitations of asynchronous designs.
Findings
Achieves state-of-the-art results on YouTube-VIS and OVIS benchmarks.
Demonstrates the effectiveness of synchronized query modeling.
Validates generality across multiple challenging datasets.
Abstract
Recent DETR-based methods have advanced the development of Video Instance Segmentation (VIS) through transformers' efficiency and capability in modeling spatial and temporal information. Despite harvesting remarkable progress, existing works follow asynchronous designs, which model video sequences via either video-level queries only or adopting query-sensitive cascade structures, resulting in difficulties when handling complex and challenging video scenarios. In this work, we analyze the cause of this phenomenon and the limitations of the current solutions, and propose to conduct synchronized modeling via a new framework named SyncVIS. Specifically, SyncVIS explicitly introduces video-level query embeddings and designs two key modules to synchronize video-level query with frame-level query embeddings: a synchronized video-frame modeling paradigm and a synchronized embedding optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection
