OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning
Anwesa Choudhuri, Girish Chowdhary, Alexander G. Schwing

TL;DR
This paper introduces OW-VISCapTor, a novel framework for open-world video instance segmentation and captioning that detects, segments, tracks, and describes unseen objects in videos using a vision-language connection via abstractors.
Contribution
It develops a new approach connecting visual features and language models through object and object-to-text abstractors for open-world video understanding.
Findings
Surpasses baseline by 13% on unseen objects
Achieves 10% improvement on object-centric captioning
Introduces a contrastive loss for diverse object queries
Abstract
We propose the new task 'open-world video instance segmentation and captioning'. It requires to detect, segment, track and describe with rich captions never before seen objects. This challenging task can be addressed by developing "abstractors" which connect a vision model and a language foundation model. Concretely, we connect a multi-scale visual feature extractor and a large language model (LLM) by developing an object abstractor and an object-to-text abstractor. The object abstractor, consisting of a prompt encoder and transformer blocks, introduces spatially-diverse open-world object queries to discover never before seen objects in videos. An inter-query contrastive loss further encourages the diversity of object queries. The object-to-text abstractor is augmented with masked cross-attention and acts as a bridge between the object queries and a frozen LLM to generate rich and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging
