Fast Video Object Segmentation via Mask Transfer Network
Tao Zhuo, Zhiyong Cheng, Mohan Kankanhalli

TL;DR
This paper introduces a fast and efficient mask transfer network for video object segmentation that eliminates the need for fine-tuning and achieves real-time processing speeds while maintaining competitive accuracy.
Contribution
The proposed Mask Transfer Network (MTN) significantly improves VOS speed by using global pixel matching on downsampled features without fine-tuning or relying on temporal cues.
Findings
Achieves 37 fps on DAVIS datasets.
Maintains competitive accuracy with state-of-the-art methods.
Does not require fine-tuning or object category information.
Abstract
Accuracy and processing speed are two important factors that affect the use of video object segmentation (VOS) in real applications. With the advanced techniques of deep neural networks, the accuracy has been significantly improved, however, the speed is still far below the real-time needs because of the complicated network design, such as the requirement of the first frame fine-tuning step. To overcome this limitation, we propose a novel mask transfer network (MTN), which can greatly boost the processing speed of VOS and also achieve a reasonable accuracy. The basic idea of MTN is to transfer the reference mask to the target frame via an efficient global pixel matching strategy. The global pixel matching between the reference frame and the target frame is to ensure good matching results. To enhance the matching speed, we perform the matching on a downsampled feature map with 1/32 of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Video Surveillance and Tracking Methods
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
