# Meta Learning Deep Visual Words for Fast Video Object Segmentation

**Authors:** Harkirat Singh Behl, Mohammad Najafi, Anurag Arnab, Philip H.S. Torr

arXiv: 1812.01397 · 2020-08-18

## TL;DR

This paper introduces a fast, finetuning-free video object segmentation method using meta-learned visual words that represent object parts, enabling robust and efficient segmentation in real-time applications.

## Contribution

It proposes a novel meta-learning approach to train visual words for object segmentation, eliminating the need for finetuning, auxiliary inputs, or post-processing.

## Key findings

- Achieves comparable accuracy to finetuning methods
- Operates 10 to 100 times faster than existing approaches
- Sets new state-of-the-art speed/accuracy trade-offs on multiple datasets

## Abstract

Personal robots and driverless cars need to be able to operate in novel environments and thus quickly and efficiently learn to recognise new object classes. We address this problem by considering the task of video object segmentation. Previous accurate methods for this task finetune a model using the first annotated frame, and/or use additional inputs such as optical flow and complex post-processing. In contrast, we develop a fast, causal algorithm that requires no finetuning, auxiliary inputs or post-processing, and segments a variable number of objects in a single forward-pass. We represent an object with clusters, or "visual words", in the embedding space, which correspond to object parts in the image space. This allows us to robustly match to the reference objects throughout the video, because although the global appearance of an object changes as it undergoes occlusions and deformations, the appearance of more local parts may stay consistent. We learn these visual words in an unsupervised manner, using meta-learning to ensure that our training objective matches our inference procedure. We achieve comparable accuracy to finetuning based methods (whilst being 1 to 2 orders of magnitude faster), and state-of-the-art in terms of speed/accuracy trade-offs on four video segmentation datasets. Code is available at https://github.com/harkiratbehl/MetaVOS.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1812.01397/full.md

## Figures

24 figures with captions in the complete paper: https://tomesphere.com/paper/1812.01397/full.md

## References

57 references — full list in the complete paper: https://tomesphere.com/paper/1812.01397/full.md

---
Source: https://tomesphere.com/paper/1812.01397