# VoCap: Video Object Captioning and Segmentation from Any Prompt

**Authors:** Jasper Uijlings, Xingyi Zhou, Xiuye Gu, Arsha Nagrani, Anurag Arnab, Alireza Fathi, David Ross, Cordelia Schmid

arXiv: 2508.21809 · 2025-09-01

## TL;DR

VoCap is a versatile video model that combines object segmentation and captioning from various prompts, trained on a new dataset, achieving state-of-the-art results in referring expression segmentation and establishing benchmarks for video captioning.

## Contribution

The paper introduces VoCap, a novel model capable of promptable video object segmentation and captioning, and creates SAV-Caption, a large dataset with pseudo and manual annotations for training and evaluation.

## Key findings

- State-of-the-art in referring expression video object segmentation
- Competitive results in semi-supervised video object segmentation
- Establishes a new benchmark for video object captioning

## Abstract

Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21809/full.md

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21809/full.md

## References

110 references — full list in the complete paper: https://tomesphere.com/paper/2508.21809/full.md

---
Source: https://tomesphere.com/paper/2508.21809