OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and   Captioning

Anwesa Choudhuri; Girish Chowdhary; Alexander G. Schwing

arXiv:2404.03657·cs.CV·April 1, 2025·1 cites

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

Anwesa Choudhuri, Girish Chowdhary, Alexander G. Schwing

PDF

Open Access 1 Video

TL;DR

This paper introduces OW-VISCapTor, a novel framework for open-world video instance segmentation and captioning that detects, segments, tracks, and describes unseen objects in videos using a vision-language connection via abstractors.

Contribution

It develops a new approach connecting visual features and language models through object and object-to-text abstractors for open-world video understanding.

Findings

01

Surpasses baseline by 13% on unseen objects

02

Achieves 10% improvement on object-centric captioning

03

Introduces a contrastive loss for diverse object queries

Abstract

We propose the new task 'open-world video instance segmentation and captioning'. It requires to detect, segment, track and describe with rich captions never before seen objects. This challenging task can be addressed by developing "abstractors" which connect a vision model and a language foundation model. Concretely, we connect a multi-scale visual feature extractor and a large language model (LLM) by developing an object abstractor and an object-to-text abstractor. The object abstractor, consisting of a prompt encoder and transformer blocks, introduces spatially-diverse open-world object queries to discover never before seen objects in videos. An inter-query contrastive loss further encourages the diversity of object queries. The object-to-text abstractor is augmented with masked cross-attention and acts as a bridge between the object queries and a frozen LLM to generate rich and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging