Towards Real-Time Open-Vocabulary Video Instance Segmentation

Bin Yan; Martin Sundermeyer; David Joseph Tan; Huchuan Lu; Federico; Tombari

arXiv:2412.04434·cs.CV·December 6, 2024

Towards Real-Time Open-Vocabulary Video Instance Segmentation

Bin Yan, Martin Sundermeyer, David Joseph Tan, Huchuan Lu, Federico, Tombari

PDF

Open Access

TL;DR

This paper introduces TROY-VIS, a real-time open-vocabulary video instance segmentation method that significantly improves processing speed while maintaining high accuracy, enabling practical applications in dynamic environments.

Contribution

The paper proposes TROY-VIS, a novel method with three key techniques that achieves real-time OV-VIS with high accuracy, outperforming existing models in speed and efficiency.

Findings

01

TROY-VIS runs 20x faster than GLEE-Lite, achieving 25 FPS.

02

It maintains high accuracy comparable to state-of-the-art methods.

03

Demonstrates potential for real-time applications in robotics and AR.

Abstract

In this paper, we address the challenge of performing open-vocabulary video instance segmentation (OV-VIS) in real-time. We analyze the computational bottlenecks of state-of-the-art foundation models that performs OV-VIS, and propose a new method, TROY-VIS, that significantly improves processing speed while maintaining high accuracy. We introduce three key techniques: (1) Decoupled Attention Feature Enhancer to speed up information interaction between different modalities and scales; (2) Flash Embedding Memory for obtaining fast text embeddings of object categories; and, (3) Kernel Interpolation for exploiting the temporal continuity in videos. Our experiments demonstrate that TROY-VIS achieves the best trade-off between accuracy and speed on two large-scale OV-VIS benchmarks, BURST and LV-VIS, running 20x faster than GLEE-Lite (25 FPS v.s. 1.25 FPS) with comparable or even better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings