One-stage Video Instance Segmentation: From Frame-in Frame-out to   Clip-in Clip-out

Minghan Li; Lei Zhang

arXiv:2203.06421·cs.CV·March 15, 2022

One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out

Minghan Li, Lei Zhang

PDF

Open Access

TL;DR

This paper introduces a clip-based approach to video instance segmentation that leverages temporal coherence within short clips, improving accuracy and efficiency over traditional frame-by-frame methods.

Contribution

It proposes a novel clip-in clip-out framework that replaces 2D with 3D convolutions for better temporal feature integration, achieving state-of-the-art results.

Findings

01

Achieves new state-of-the-art mask AP on multiple datasets.

02

Demonstrates effectiveness of clip-based segmentation over frame-based methods.

03

Easily integrates into existing VIS frameworks.

Abstract

Many video instance segmentation (VIS) methods partition a video sequence into individual frames to detect and segment objects frame by frame. However, such a frame-in frame-out (FiFo) pipeline is ineffective to exploit the temporal information. Based on the fact that adjacent frames in a short clip are highly coherent in content, we propose to extend the one-stage FiFo framework to a clip-in clip-out (CiCo) one, which performs VIS clip by clip. Specifically, we stack FPN features of all frames in a short video clip to build a spatio-temporal feature cube, and replace the 2D conv layers in the prediction heads and the mask branch with 3D conv layers, forming clip-level prediction heads (CPH) and clip-level mask heads (CMH). Then the clip-level masks of an instance can be generated by feeding its box-level predictions from CPH and clip-level features from CMH into a small fully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Feature Pyramid Network · Dropout · Convolution · Dense Connections · Residual Connection · Layer Normalization · 1x1 Convolution