Video Instance Segmentation using Inter-Frame Communication Transformers
Sukjun Hwang, Miran Heo, Seoung Wug Oh, Seon Joo Kim

TL;DR
This paper introduces Inter-frame Communication Transformers (IFC), a new efficient transformer-based approach for video instance segmentation that achieves state-of-the-art accuracy with high speed and low computational overhead.
Contribution
The paper presents IFC, a novel transformer architecture with memory tokens that efficiently encodes inter-frame context for improved video instance segmentation.
Findings
Achieved state-of-the-art AP 44.6 on YouTube-VIS 2019 val set.
Processed videos at 89.4 FPS, enabling real-time inference.
Reduced computational overhead compared to previous per-clip models.
Abstract
We propose a novel end-to-end solution for video instance segmentation (VIS) based on transformers. Recently, the per-clip pipeline shows superior performance over per-frame methods leveraging richer information from multiple frames. However, previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications, limiting practicality. In this work, we propose Inter-frame Communication Transformers (IFC), which significantly reduces the overhead for information-passing between frames by efficiently encoding the context within the input clip. Specifically, we propose to utilize concise memory tokens as a mean of conveying information as well as summarizing each frame scene. The features of each frame are enriched and correlated with other frames through exchange of information between the precisely encoded memory tokens. We validate our method on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Video Surveillance and Tracking Methods
