One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out
Minghan Li, Lei Zhang

TL;DR
This paper introduces a clip-based approach to video instance segmentation that leverages temporal coherence within short clips, improving accuracy and efficiency over traditional frame-by-frame methods.
Contribution
It proposes a novel clip-in clip-out framework that replaces 2D with 3D convolutions for better temporal feature integration, achieving state-of-the-art results.
Findings
Achieves new state-of-the-art mask AP on multiple datasets.
Demonstrates effectiveness of clip-based segmentation over frame-based methods.
Easily integrates into existing VIS frameworks.
Abstract
Many video instance segmentation (VIS) methods partition a video sequence into individual frames to detect and segment objects frame by frame. However, such a frame-in frame-out (FiFo) pipeline is ineffective to exploit the temporal information. Based on the fact that adjacent frames in a short clip are highly coherent in content, we propose to extend the one-stage FiFo framework to a clip-in clip-out (CiCo) one, which performs VIS clip by clip. Specifically, we stack FPN features of all frames in a short video clip to build a spatio-temporal feature cube, and replace the 2D conv layers in the prediction heads and the mask branch with 3D conv layers, forming clip-level prediction heads (CPH) and clip-level mask heads (CMH). Then the clip-level masks of an instance can be generated by feeding its box-level predictions from CPH and clip-level features from CMH into a small fully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Feature Pyramid Network · Dropout · Convolution · Dense Connections · Residual Connection · Layer Normalization · 1x1 Convolution
