Towards Real-Time Open-Vocabulary Video Instance Segmentation
Bin Yan, Martin Sundermeyer, David Joseph Tan, Huchuan Lu, Federico, Tombari

TL;DR
This paper introduces TROY-VIS, a real-time open-vocabulary video instance segmentation method that significantly improves processing speed while maintaining high accuracy, enabling practical applications in dynamic environments.
Contribution
The paper proposes TROY-VIS, a novel method with three key techniques that achieves real-time OV-VIS with high accuracy, outperforming existing models in speed and efficiency.
Findings
TROY-VIS runs 20x faster than GLEE-Lite, achieving 25 FPS.
It maintains high accuracy comparable to state-of-the-art methods.
Demonstrates potential for real-time applications in robotics and AR.
Abstract
In this paper, we address the challenge of performing open-vocabulary video instance segmentation (OV-VIS) in real-time. We analyze the computational bottlenecks of state-of-the-art foundation models that performs OV-VIS, and propose a new method, TROY-VIS, that significantly improves processing speed while maintaining high accuracy. We introduce three key techniques: (1) Decoupled Attention Feature Enhancer to speed up information interaction between different modalities and scales; (2) Flash Embedding Memory for obtaining fast text embeddings of object categories; and, (3) Kernel Interpolation for exploiting the temporal continuity in videos. Our experiments demonstrate that TROY-VIS achieves the best trade-off between accuracy and speed on two large-scale OV-VIS benchmarks, BURST and LV-VIS, running 20x faster than GLEE-Lite (25 FPS v.s. 1.25 FPS) with comparable or even better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
