RefineVIS: Video Instance Segmentation with Temporal Attention Refinement
Andre Abrantes, Jiang Wang, Peng Chu, Quanzeng You, Zicheng Liu

TL;DR
RefineVIS is a new video instance segmentation framework that iteratively refines object association and segmentation masks using temporal attention and contrastive learning, achieving state-of-the-art results.
Contribution
It introduces a dual-representation approach with temporal attention refinement and contrastive learning for improved accuracy in VIS.
Findings
Achieves 64.4 AP on YouTube-VIS 2019
Achieves 61.4 AP on YouTube-VIS 2021
Achieves 46.1 AP on OVIS dataset
Abstract
We introduce a novel framework called RefineVIS for Video Instance Segmentation (VIS) that achieves good object association between frames and accurate segmentation masks by iteratively refining the representations using sequence context. RefineVIS learns two separate representations on top of an off-the-shelf frame-level image instance segmentation model: an association representation responsible for associating objects across frames and a segmentation representation that produces accurate segmentation masks. Contrastive learning is utilized to learn temporally stable association representations. A Temporal Attention Refinement (TAR) module learns discriminative segmentation representations by exploiting temporal relationships and a novel temporal contrastive denoising technique. Our method supports both online and offline inference. It achieves state-of-the-art video instance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection · Image Enhancement Techniques
MethodsContrastive Learning
