VoCap: Video Object Captioning and Segmentation from Any Prompt
Jasper Uijlings, Xingyi Zhou, Xiuye Gu, Arsha Nagrani, Anurag Arnab, Alireza Fathi, David Ross, Cordelia Schmid

TL;DR
VoCap is a versatile video model that combines object segmentation and captioning from various prompts, trained on a new dataset, achieving state-of-the-art results in referring expression segmentation and establishing benchmarks for video captioning.
Contribution
The paper introduces VoCap, a novel model capable of promptable video object segmentation and captioning, and creates SAV-Caption, a large dataset with pseudo and manual annotations for training and evaluation.
Findings
State-of-the-art in referring expression video object segmentation
Competitive results in semi-supervised video object segmentation
Establishes a new benchmark for video object captioning
Abstract
Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations…
| Task | Semi-Supervised Video Object Segmentation | Referring Video Object Segmentation | ||||
| Dataset | YTVOS 2018 | MOSE (zero-shot) | RefVOS-DAVIS | RefVOS-YTVOS | MeViS | UVO-VLN |
| metric | J&F | J&F | J&F | J&F | ||
| SAM2 Ravi et al. (2024) | 85.0 | 66.4 | \na | \na | \na | \na |
| Point-VOS Zulfikar et al. (2024) | 73.7 | - | - | - | - | 52.8 |
| ReferFormer Wu et al. (2022) | \na | \na | 61.1 | 64.9 | - | 46.4 |
| SOC Luo et al. (2024) | \na | \na | 67.2 | 67.3 | - | - |
| DsHmp He and Ding (2024) | \na | \na | 64.9 | 67.1 | 46.4 | - |
| FindTrack Cho et al. (2025) | \na | \na | 74.2 | 70.3 | 48.2 | - |
| UniRef++ Wu et al. (2023) | 83.2 | 59.0 | 67.2 | 67.4 | - | - |
| GLEE Wu et al. (2024a) | 80.4 | 56.1 | - | 70.6 | - | - |
| VoCap (ours) | 85.0 | 66.3 | 75.1 | 70.3 | 51.9 | 62.2 |
| VoCap + FindTrack (ours) | \na | \na | 74.7 | 71.2 | 53.0 | 62.7 |
| SAV-Caption-val (manual) | RefVOS-YTVOS | ||
|---|---|---|---|
| Captioning | SS-VOS | RefVOS | |
| CIDEr | J&F | J&F | |
| Full training (phase i, ii, and iii) | 47.8 | 75.5 | 70.3 |
| Pre-training and Multitask training (phase i and ii) | 44.8 | 75.1 | 68.7 |
| 50% of \savs-train | 42.1 | 74.9 | 69.2 |
| 25% of \savs-train | 42.1 | 72.0 | 68.2 |
| 0% of \savs-train | 27.4 | 57.7 | 66.6 |
| correct | incorrect (hallucinations) | # evaluated | # missing aspects | |
|---|---|---|---|---|
| object category | 88.0% | 12.0% | 50 | - |
| object properties | 87.6% | 12.4% | 105 | 7 |
| object motion / action | 85.5% | 15.5% | 62 | 5 |
| config | value |
|---|---|
| data | \savs, RefVOS-YTVOS, |
| RefCOCO, VisualGenome | |
| data-ratio | 2: 1: 1: 0.5 |
| steps | 240k |
| backbone | Eva02 |
| resolution | 512 |
| optimizer | AdamW |
| optimizer momentum | |
| gradient clipping | type: , max: 0.1 |
| weight decay | 0.05 |
| learning rate (lr) | 5e-5 |
| lr schedule | cosine |
| warmup | linear, 1k iters |
| layer-wise decay | 0.8 |
| augmentation | crop and square resize to 512 |
| batch size | 32 |
| drop path | 0.4 |
| mask losses (weight) | focal (20), dice (1) |
| IoU loss (weight) | (1) |
| occlusion loss (weight) | cross-entropy (1) |
| caption loss (weight) | cross-entropy (1) |
| caption loss label smooth | 0.1 |
| num frames | 8 |
| max. masks per frame. | image: 32, video: 2 |
Peer Reviews
Decision·Submitted to ICLR 2026
The paper explores an interesting direction by attempting to unify video object segmentation and captioning within a single framework. The idea of leveraging different input modalities (text, box, mask) is conceptually appealing and potentially useful for future multimodal understanding tasks. The paper is clearly written and easy to follow, with well-organized structure and visual illustrations.
The proposed VoCap framework mainly stacks existing techniques (SAM2 for segmentation and BLIP2-style text decoding) with minimal methodological innovation. The model design lacks substantial novelty or clear insight into how segmentation and captioning are effectively integrated beyond simple module combination. The experimental validation is insufficient and somewhat superficial; it primarily reports improvements on internal benchmarks without solid comparisons to recent or stronger baseline
The paper builds an auto-mated pipeline to label the data. It reduces the cost of human-labeling. The authors evaluate the performance of both the model and the datasets.
1. The paper does not mention "Sa2VA" or the "Ref-SAV" dataset. Consequently, it lacks a direct comparison between its pseudo-labeling pipeline (which creates SAV-Caption) and the one used by Sa2VA. Given that the data labeling pipelines appear similar, the novelty of VoCap's contribution seems limited. 2. The paper omits "InstructSeg" and "Sa2VA" from its comparison tables. In Section 5.2 and Table 4, the authors compare VoCap's performance on Referring Video Object Segmentation (RefVOS) again
Advantages: 1. Efficient data utilization: By using pseudo-labels, the model significantly expands the training data and reduces manual labor costs. 2. The paper is clear and easy to follow. 3. Subtitle training improves the understanding of referring expressions, showcasing the advantages of language-vision collaboration. 4. Qualitative examples demonstrate the effectiveness of the method.
Disadvantages: 1. What is the key difference in terms of spatio-temporal reasoning (mask-level) between the proposed method and [1]? [1] "VISA: Reasoning Video Object Segmentation via Large Language Models" 2. The argument that "there is yet no existing computer vision system that is capable of both spatio-temporal localization via segmentation masks, as well as a semantic understanding of objects via natural language" might be an over-claim. 3. The masklet (three separate temporal masks) could
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Visual Attention and Saliency Detection
\providetoggle
showcomments \settoggleshowcommentstrue
VoCap: Video Object Captioning and Segmentation from Any Prompt
Jasper Uijlings &Xingyi Zhou∗ &Xiuye Gu &Arsha Nagrani &Anurag Arnab &Alireza Fathi &David Ross
Google DeepMind &Cordelia Schmid Equal contribution. Correspondence: [email protected] done while at Google DeepMind
Abstract
Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset is available at https://github.com/google-deepmind/vocap.
1 Introduction
Understanding objects in videos, including both their fine-grained locations (represented as segmentation masks) as well as their detailed semantic properties, is a fundamental task in video understanding. It serves as a basic block for various applications, including video generation and editing Chai et al. (2023); Hu et al. (2024); Wang et al. (2024b), wild animal care Beery et al. (2020); Sun et al. (2024), and self-driving Caesar et al. (2020); Sun et al. (2020). While it is trivial for a human to point to an object in a video and describe it in detail, there is yet no existing computer vision system that is capable of both spatio-temporal localization via segmentation masks, as well as a semantic understanding of objects via natural language.
In this paper, we propose a model and data for fine-grained video object understanding with flexible inputs and outputs modalities. Our model consumes a video and an input prompt, where the prompt can be a mask and box, but also natural language (i.e. referring expression). Our model then produces both a spatio-temporal mask (i.e., a ‘masklet’) and a free-form natural language caption describing the object. Because the output caption is a free-form sentence, it can describe the attributes of the object as well as how they change over time. Our model can be used for a variety of tasks bridging localization and language, for example referring object segmentation Yu et al. (2016); Seo et al. (2020) or location-conditioned captioning Krishna et al. (2017b), which we extend to video.
Several previous works attempt to bridge this gap between visual localization and language understanding - for example segmentation via free-form referring expressions Khoreva et al. (2018); Seo et al. (2020); Wu et al. (2022), where the goal is to produce a segmentation mask for an object given a short description which refers to a single object, or dense video object captioning (DenseVOC) Zhou et al. (2023), which produces bounding boxes and captions for all classes within a certain vocabulary in a video. While localization with referring expressions Khoreva et al. (2018); Seo et al. (2020); Wu et al. (2022) which typically only takes in a minimal-required text to identify an object in the input, our model can also produce detailed captions given a location prompt. Unlike DenseVOC Zhou et al. (2023) – which is non-promptable, is trained on a fixed set of objects, and which is limited to producing boxes only – our model works with flexible input prompts and produces dense masklets as output in addition to captions. Our model is inspired by both existing captioning Vision-language models (VLMs) such as BLIP2 Li et al. (2023), as well as promptable segmentation models such as SAM2 Ravi et al. (2024), and brings a number of related video segmentation and captioning tasks together while enabling cross-task synergies. Specifically, on top of the general SAM2 design Ravi et al. (2024) we introduce a lightweight shared text encoder and decoder based on BERT Devlin et al. (2019), and an efficient caption feature extractor similar to QFormer Li et al. (2023). These modules are pre-trained on large-scale vision-language datasets Chen et al. (2022), and are used to encode the text prompt and decode the output caption. This results in a unified promptable model for Video Object Captioning and Segmentation from Any Prompt (VoCap) that takes as input a prompt (text, mask, or box) and outputs segmentation mask and caption jointly (See Fig. 1).
Obtaining data to train our VoCap model is a significant challenge – annotating video with segmentation masklets and captions is tedious and expensive, and not easily scalable to large volumes of data. Hence we propose a pseudo-labeling pipeline starting with the SAV Manual dataset Ravi et al. (2024), which contains accurate segmentation masks. We then automatically generate object-centric captions using a large-scale VLM (Gemini 1.5 Pro Vision Gemini Team (2024)). By pre-processing the videos to highlight each object mask and blur the background, we steer the VLM to describe each object and what happens to it with satisfying accuracy and details. This enables us to generate a large-scale training set with masks and object-centric captions without additional human labor. We then combine this dataset with existing partially annotated datasets Krishna et al. (2017b); Seo et al. (2020); Yu et al. (2016); Chen et al. (2022) to co-train our model.
For evaluation, we ran a human annotation campaign on the SAV-val dataset Ravi et al. (2024) where each object is captioned by three different annotators. Furthermore, we evaluate our model on existing datasets and tasks such as mask-prompted video segmentation on MOSE Ding et al. (2023b) and referring segmentation on RefVOS Seo et al. (2020). To summarize, we make the following contributions:
- •
We present a unified promptable model VoCap that can produce both spatio-temporal masklets and captions for objects in video. Our model is flexible in both the input and the output, taking as input a prompt (text, mask, or box) and outputting masklets and captions.
- •
We collect manually annotated object captions on SAV-val and create pseudo-captions by leveraging existing mask annotations on SAV-train using Gemini Pro 1.5. We make both the manual and pseudo-annotations publicly available 111https://github.com/google-deepmind/vocap/. By showing good performance on the manually annotated object captions, we demonstrate that these pseudo-labels are effective for training our captioning model.
- •
We set a new state-of-the art for Referring Expression Video Object Segmentation, show competitive results on semi-supervised object segmentation, and establish a benchmark for video object captioning.
2 Related Work
\para
Segmentation and Captioning Models. A variety of models are dedicated to video segmentation and expect an initial input mask Yang et al. (2021, 2024); Yang and Yang (2022); Cheng and Schwing (2022); Cheng et al. (2024); Guo et al. (2024); Deng et al. (2024); Ravi et al. (2024); Yang et al. (2023b), a referring expression Seo et al. (2020); Lan et al. (2023); Wu et al. (2022), or both Cheng et al. (2023); Wu et al. (2023). CLIPSeg Lüddecke and Ecker (2022) can also consume a query image. SAM2 Ravi et al. (2024) can do segmentation without any inputs as is done in DEVA Cheng et al. (2023) by starting from the original SAM Kirillov et al. (2023) with a point grid prompts. However, none of these models can generate descriptions. In captioning, there are works on global video captioning which describe the whole video Kanani et al. (2021); Iashin and Rahtu (2020); Yang et al. (2023a); Yao et al. (2015); Wang et al. (2021a), or on dense image captioning which provide object-centric captions and their locations in images Johnson et al. (2016); Li et al. (2019); Shao et al. (2022); Zhang et al. (2023); Yuan et al. (2024); Peng et al. (2023); Xu et al. (2024). Only few works do dense video captioning Choudhuri et al. (2024); Zhou et al. (2023) for objects. DenseVOC Zhou et al. (2023) predicts bounding boxes with captions but does not predict masks and only detects a predefined set of object classes. The OW-VISCapTor model Choudhuri et al. (2024) predicts segments with captions, but cannot handle textual or mask input prompts. Furthermore, OW-VISCapTor is based on an image-first tracking-by-detection paradigm that can be suboptimal in long videos with occlusions, while we build on top of strong memory-based trackers Ravi et al. (2024); Cheng and Schwing (2022) and can handle long and challenging videos Ding et al. (2023b)
\para
Datasets. While numerous datasets exist for video object segmentation and captioning separately, very few combine both on the same set. Video segmentation datasets include various input forms, including a mask given on the first frame (semi-supervised video object segmentation, or SS-VOS) Perazzi et al. (2016); Caelles et al. (2019); Ding et al. (2023b); Qi et al. (2022); Wang et al. (2021b), a target object class (semantic object segmentation) Kim et al. (2020); Real et al. (2017); Russakovsky et al. (2015) or a referring expression Ding et al. (2023a); Khoreva et al. (2018); Seo et al. (2020); Wu et al. (2022) (Referring Video Object Segmentation, or RefVOS). While referring expression datasets have masks and referring text, this text tends to be mainly focused on identifying the object in the video, not describing it. To link captions better to the visual domain, several datasets focus on having grounded captions in both the image domain Krishna et al. (2017b); Lin et al. (2024); Plummer et al. (2015); Peng et al. (2023); Pont-Tuset et al. (2020); Wang et al. (2023b, 2024a); Xue et al. (2024) and video domain Voigtlaender et al. (2023); Zhang et al. (2020); Zhou et al. (2018a, 2019). In particular, in the video domain, bounding box annotations are added in Zhou et al. (2018a) to YouCook2 Zhou et al. (2018b), and in Zhou et al. (2019) to ActivityNet Krishna et al. (2017a). In Zhang et al. (2020) the relations of VidOR Shang et al. (2019) are converted into captions while grounding is provided by the existing bounding boxes. BenSMOT Li et al. (2024) provides a human-focused dataset with boxes, their object-centric captions, and interactions. Video Localized Narratives Voigtlaender et al. (2023) introduced an object-centric protocol in which captions are grounded by a mouse trace. In contrast, in this paper we provide a stronger form of grounding by linking captions to segmentation masks. Our pseudo-labeled dataset is also an order of magnitude larger than these datasets (see Table 1).
\para
Pseudo labels. With increasing model capabilities and increasing data requirements for training large models, it is increasingly common to use automatically generated labels in the pre-training stage. SAM2 Ravi et al. (2024) provides automatically generated masks on their SAV dataset, enabling distillation. BLIP3 Xue et al. (2024), OWLv2 Minderer et al. (2023), and Kosmos-2 Peng et al. (2023) go beyond distillation for bounding box generation by exploiting existing captions: they extract noun phrases from the captions, feed them to an open-world detector, and only keep high-scored boxes. MVDP Lin et al. (2024) draws existing object classes and location annotations in an image using set-of-masks Yang et al. (2023c) and feeds this to GPT-4V OpenAI (2023,) to generate object-centric captions, relationships, and Q&A pairs. In this paper, we augment videos with ground-truth segmentation masks and prompt vision-language models to create high-quality pseudo captions.
3 The SAV-Caption Dataset
We want to have a large-scale training set with spatio-temporal segmentation masks and their captions. Therefore we start from SAV Ravi et al. (2024), the largest and most diverse video dataset with segmentation masks. We use the ‘Manual’ part which was annotated by combining SAM2 predictions Ravi et al. (2024) with human annotator corrections to ensure high-quality masklets. Next we detail how we add captions to the existing SAV segmentation dataset using automatic annotations (Sec. 3.1) and human annotations (Sec. 3.2). In both cases we want to have captions with the object class, its visual properties (which aligns the captions with visual referring expressions), and what it does (which captures the temporal semantics). Such captions are aligned with previous dense video captioning datasets (e.g. Voigtlaender et al. (2023); Zhang et al. (2020)).
3.1 Automatically Annotated Training Data
We use Gemini 1.5 Pro Vision Gemini Team (2024) to automatically generate captions on this dataset. This model is a long-context vision language model and is therefore suited to consume relatively large video clips ( frames). To create accurate captions, we draw inspiration from works which augment images with visual prompts to focus the attention of the visual models to what matters, thereby simplifying the task Nasiriany et al. (2024); Yang et al. (2023c); Zheng et al. (2024); Wu et al. (2024c); Shtedritski et al. (2023). In particular, we adopt two visual prompting techniques: 1) We highlight the target segment by drawing a clear red contour around it (Contour); 2) upon finding that Gemini would still sometimes focus on objects in the background, we blurred the background using a Gaussian filter (Blur). Both modifications are explicitly mentioned in the textual prompt. An example of the video frame we fed to Gemini can be seen in \reffigprompt.
For the textual part of the prompt, we carefully iterated to increase the quality of the generated caption. In this process, we found it helpful to structure the prompt: we ask to describe first the object, then its visual properties, and then what it does, and finally we ask it to give the caption while keeping earlier mentioned elements consistent. Statistics of SAV-Caption train are given in Tab. 1, and example captions are shown in \reffiggemini_train_visu. The exact prompt and a quantitative analysis is given in the supplementary.
3.2 Human Annotated Validation Data
Our evaluation should be free of any potential biases of any Visual Language Model. Therefore we collect our evaluation set fully manually with three captions per object. In particular, we start from SAV-val and instruct the raters to provide a single free-form caption of the object highlighted in the video. Like in Sec. 3.1 we highlight the object with a red border but we do not blur the background. We have explicit instructions for the annotators to include in their caption the object class, its visual properties, and what it does. We also ask them to not mention irrelevant objects in the background. The statistics of SAV-Caption val are given in Tab. 1. The annotation instructions and the UI can be found in the supplementary material.
3.3 Comparison with Other Datasets
There only exist few video datasets where objects are annotated with both spatio-temporal masklets and captions Ding et al. (2023a); Khoreva et al. (2018); Seo et al. (2020). These existing datasets were all made for referring expression segmentation but can be repurposed for the captioning task. However, referring expressions were made with the intention for objects to be uniquely identifiable, not for semantic understanding. Furthermore, our training set is at least one order of magnitude bigger.
4 VoCap Model
Given an image or video (for images, ) and a prompt, where the prompt can be a bounding box or a mask in the first frame, or a textual description, our \vocap model produces a binary masklet and a caption string for the corresponding object.
4.1 Model Architecture
As illustrated in \reffigframework, our model is composed of segmentation modules inspired by Ravi et al. (2024), including an image encoder, a memory encoder, a memory attention module, a location prompt encoder, and a mask decoder. We add new language modules: a text encoder, a text feature extractor, and a text decoder. As a result, our model can take both texts or masks as inputs or as outputs.
The image encoder takes a single frame as input and produces down-sampled image features . This can be any visual backbone, and we use eva02 Fang et al. (2023) given its dedicated pretraining for both language Radford et al. (2021) and localization tasks He et al. (2022). Following ViTDet Li et al. (2022), we use simple convolutional upsampling layers Ronneberger et al. (2015); Zheng et al. (2021) to produce multi-scale features as additional inputs for the mask decoder Ravi et al. (2024). Note that each frame is processed separately without temporal communication.
The memory encoder and memory attention together augment the per-frame image feature with temporal information. Specifically, at each timestamp, the memory encoder fuses the input image and output mask into a memory feature, which is stored in a memory bank that keeps a history of -dimensional spatio-temporal appearance features (memory dimension can be different from feature dimension ). Following SAM2 Ravi et al. (2024) we use a fixed-sized memory bank with a first-in-first-out memory queue. There are several cross attention layers between the current image feature and the memory bank which makes the output image features temporally-aware.
The location prompt encoder projects location inputs to embeddings. Specifically, box prompts are encoded as sparse embeddings , where is the number of points ( for 2 box-corners) and is the feature dimension. Mask prompts are encoded as dense embeddings with the same shape as the image feature.
The text encoder takes text strings as inputs and projects them to embeddings. It can be any language model Devlin et al. (2019); Team et al. (2024a, b); Raffel et al. (2020); Touvron et al. (2023) that encodes the integer vocabulary indexes to embeddings. Specifically, we feed text prompts as the text prefix to the language model with full attention, and extract the features before the vocabulary classification layer. We use an additional dimension-matching layer to project from the language model embedding space to the prompt embedding space. We reuse our sparse embedding notation for text prompts. Here is the number of tokenized words in the text query. Because the text prompt provides conditioning for the entire video and because the target object does not always appear in the early frames of the video, we feed the text prompt embedding to all frames of the video.
The mask decoder takes the temporal-aware image feature and the prompt features or as inputs, and outputs the mask at the current frame . In SAM, the mask decoder uses cross attention to communicate the image and prompt features:
[TABLE]
where is a learned mask token and is concatenated with the sparse prompt , and is the cross-attention operation. is a mask decoding function with upsampling convolutions and a final dot-product Cheng et al. (2021). The output of the cross-attention, , can be considered as the object feature conditioned on the prompt. Besides the mask, the mask decoder also predicts for each frame a binary object appearance indicator to handle occlusion or out-of-view movement, and an IoU prediction which estimates the quality of the mask.
Text feature extractor. Similar to how the object features are extracted in the mask decoder in \refeqcross-attention, we use learned caption tokens and cross attention to extract caption features for each object:
[TABLE]
where we only use output and discard and . This formulation is analogous to popular vision-feature extractors in vision-language models Jaegle et al. (2021); Alayrac et al. (2022); Ryoo et al. (2021); Li et al. (2023), while we additionally condition on the prompt embeddings or . Following BLIP2 Li et al. (2023), we use tokens for the caption tokens.
Text decoder. Following popular vision-language model design Li et al. (2023); Wang et al. (2022); Liu et al. (2023) we feed the object-aware caption feature as prefix to an auto-regressive language model to produce object caption :
[TABLE]
Again, the text decoder can be any language model Devlin et al. (2019); Team et al. (2024a, b); Raffel et al. (2020); Touvron et al. (2023) with a causal attention mask. We note that both the architecture and the weights of the text encoder and text decoder can be shared even though the text decoder uses causal attention, and the text encoder uses bidirectional attention. Therefore, during training, the language model is updated for both text encoding and decoding regardless of whether we use text as an input prompt or as a target output caption. We follow the standard transformer decoder Vaswani et al. (2017); Devlin et al. (2019) as it is simple and effective Wang et al. (2022); Wu et al. (2024b); Zhou et al. (2023).
4.2 Training
Given our flexibility on inputs and outputs, our model can leverage a variety types of annotations from different datasets: For SAV-Caption our model consumes a mask prompt and calculates the loss on both the predicted masklet and the predicted caption. On VisualGenome Krishna et al. (2017a) our model consumes a box prompt and calculates the loss on the caption. For SS-VOS we have a first frame mask input prompt and a loss on the masklet. For RefVOS we have a text input prompt and a loss on the masklet. Following other joint models for image and video, we treat images as a single-frame video Ravi et al. (2024); Villegas et al. (2022); Bain et al. (2021). Concretely, we do not use the memory module (specifically, for , ) for the first frame or images. To leverage all available data, we first pre-train our language and vision components separately, then perform multi-task training with joint mask- and caption-annotations. Finally, for achieving the best performance, we finetune on specific datasets. See details in 5.1.
4.3 Inference
\lblsec
inference Our model runs on images or videos of arbitrary lengths. Like in training, for an image or the first frame of the video, the visual features are not modified by the memory attention since there are no memories (again, ). For the following frames of the video, our model runs in an online manner: in each frame the model produces both mask and caption outputs, and updates the memory. Using the object appearance prediction results, we keep the captions where the object exists in the frame, and take the most common caption prediction as the final caption for the trajectory.
5 Experiments
5.1 Implementation Details
We implement our model in JAX Bradbury et al. (2018). Our image encoder is EVA02-L Fang et al. (2023), a 24-layer vision transformer (ViT) Dosovitskiy (2021) with MAE He et al. (2022) and CLIP Radford et al. (2021) pretraining. We chose this encoder as it is more suitable for language tasks, compared to the MAE-pretrained ViT used in SAM Kirillov et al. (2023); Ravi et al. (2024). Our shared language encoder and decoder is a 6-layer BERT model Devlin et al. (2019) with random initialization, which has been shown to be effective and efficient for object captioning in several works Wu et al. (2024b); Zhou et al. (2023); Wang et al. (2022). The text feature extractor contains 2 cross-attention layers with the same architecture as the mask decoder. Other modules follow the SAM2 Ravi et al. (2024) architecture and are all randomly initialized. In appendix B we show that our re-implementation of SAM2 is comparable to the original, and that EVA02-L is a strong alternative backbone.
We have three training phases: (i) a pre-training phase to initialize weights of the visual and language modalities separately. (ii) A multi-task training phase in which we train our full model end-to-end jointly on the three tasks, namely Captioning, SS-VOS, and RefVOS. (iii) A finetuning phase in which we optimize the model per dataset. In more detail, for the pre-training phase (i) we use the text encoder and decoder of an existing checkpoint trained for image captioning on WebLI Chen et al. (2022). Since we use an eva02 image backbone, we cannot use the existing SAM2 checkpoints. Instead, we pre-train our visual components on SAV Ravi et al. (2024), YTVOS Xu et al. (2018), and DAVIS Perazzi et al. (2016), following the SAM2 data mixture ratio (49.5: 9.2: 1.3). We train 300k iterations, using a batch size 64 at resolution. We verified that this training recipe produces results close to the official SAM2 model which is trained on proprietary datasets and uses a larger resolution (see supplementary for more details).
In our multi-task training phase (ii), we use datasets with both language and segmentation annotations (Tab. 2): VisualGenome Krishna et al. (2017b), RefCOCO Yu et al. (2016); Mao et al. (2016, 2016), RefVOS-YTVOS Seo et al. (2020), and our \savstrain, with a data mixture ratio 0.5: 1: 1: 2. We train for 240k iterations with batch size 32 using a input resolution. The multi-task training phase takes hours on 32 H100 GPUs. Since most prior work report numbers specialized per dataset, we have a small finetuning stage per dataset.
5.2 Captioning
The localized captioning task is defined as producing a text caption given a location prompt (e.g. box or mask). For images we perform image captioning on Visual Genome given a box around an object as input prompt. For videos we address localized object captioning with when prompted by a mask annotation for the first video frame. For both the image and captioning tasks we use the standard CIDEr Vedantam et al. (2015) metric.
\para
Video Object Captioning Baselines. Since our VoCap model is the first model which can do simultaneous object segmentation and captioning given a first-frame input mask, there are no existing methods to which compare to. However, we present results for a few strong baselines. First, we run a semi-supervised VOS method to obtain segments, and feed these into existing off-the-shelf captioning models. In particular, we run our re-implementated and retrained SAM2 model Ravi et al. (2024) as the SS-VOS method and apply the popular captioning models BLIP2 Li et al. (2023) (which predicts captions from single images without any additional prompt) and PixelLLM Xu et al. (2024) (which predicts captions from bounding-box location prompts in single images). For BLIP2 Li et al. (2023), we follow CaptionAnything Wang et al. (2023a) to use the SAM2 mask to crop and mask-out the background. For PixelLLM Xu et al. (2024), we extract the bounding box as the prompt from the SAM2 mask. These image baselines produce a caption in each frame, and we take a single video-level caption by taking the most common captions for the image caption sequence.
In addition, we create two baselines which closely follow our annotation pipeline: We use SAM2 Ravi et al. (2024) and UniRef++ Wu et al. (2023) to generate segmentation masks based on a first frame input mask. Then we feed these generated segments to our Gemini Gemini Team (2024) pseudo-annotation pipeline (Sec. 3.1).
\para
Video Object Captioning Results. We finetune VoCap jointly on SAV-Caption-train and VisualGenome Krishna et al. (2017b). Tab. 4 presents results on the SAV-Caption-val. VoCap significantly outperforms all baselines in captioning at only a minor decrease in segmentation performance compared to SAM2. In particular, BLIP2 Li et al. (2023) and PixelLLM Xu et al. (2024) yield suboptimal performance, likely since these image-based models do not capture motion. More importantly, our results (47.8 CIDEr) surpass applying SAM2 plus Gemini pseudo-labeling (40.5 CIDEr) despite being significantly more efficient (Gemini is much larger than VoCap). To understand how our model could outperform this strong baseline, we visually inspected the results. We observed that Gemini typically makes mistakes in small objects (presumably due to resolution) and that it has an ‘actor bias’: it sometimes describes a human (hand) or animal which is near the highlighted object. In contrast, since our model actively tracks an object, it always describes the object which it is tracking. Some qualitative examples can be found in Appendix C. From a more general learning perspective, by training on large amounts of data our model can correct or smooth out some of the noise of the pseudo-labels, which is a commonly observed phenomenon (e.g. Jia et al. (2021); Lee (2013); Radford et al. (2021)).
\para
Image Object Captioning. There are several works on localized image captioning, where the input is an image and a bounding box around an object, and the output is the caption describing the object. Since our model can also consume box prompts, and since images can be interpreted as single-frame videos, we can directly compare to these works. We evalate the same VoCap model as before (finetuned jointly on SAV-Caption-train and VisualGenome) and evaluate it on the 5k validation images of VisualGenome Krishna et al. (2017b) which has human-annotated object captions. Again, we report the standard captioning metric, CIDEr Vedantam et al. (2015). Results in Tab. 4 show that our method outperforms the state-of-the-art on this task: 150 CIDEr for SCA Huang et al. (2024) vs 163 CIDEr for our VoCap model.
5.3 Video Object Segmentation
In addition to video object captioning, our model can perform both semi-supervised video object segmentation (SS-VOS) and referring expression video object segmentation (RefVOS). In SS-VOS the model input is a video and a ground-truth object mask for the first frame. In RefVOS the input is a video and a textual referring expression of the target object. For both tasks the output is a spatio-temporal masklet throughout the whole video which segments the object in every single frame.
\para
Datasets. For SS-VOS we evaluate on the popular YTVOS 2018 dataset Xu et al. (2018) and MOSE Ding et al. (2023b). MOSE was designed to be extra difficult, featuring heavy occlusion and out-of-view motion. To compare to related works, we do not train on the MOSE training set and instead evaluate in a ‘zero-shot’ setting. Results were obtained using the official test servers on CodaLab.
For RefVOS we evaluate on the popular video referring segmentation datasets RefVOS-YTVOS Seo et al. (2020), RefVOS-DAVIS Khoreva et al. (2018), MeVis Ding et al. (2023a) and UVO-VLN Voigtlaender et al. (2023). For RefVOS-DAVIS Khoreva et al. (2018) we follow UniRef++ Wu et al. (2023) to only use its validation set of 30 videos as a zero-shot evaluation (on average 2 objects per video and with 4 text queries per object). The UVO-VLN Video Narrative Grounding (VNG) benchmark provides image descriptions and segmentation masks of labeled noun phrases. To turn a description into a referring expressions we simply mark the target noun with brackets (e.g. ‘the dog catches the [frisbee]’).
Now one problem with referring expressions is that the first frame may not have the clearest view of the object, it could be ambiguous (e.g. for ‘the bird flying away’ there could be three birds where one of them flies away only at the end) or not even visible at the first frame. Such cases are problematic for our model since it will be biases to track keep tracking the object predicted in the first frame. To overcome this we experimented with the test-time inference method of FindTrack Cho et al. (2025): We apply VoCap to each frame independently to produce masks with IoU predictions . We start from the mask and frame with the highest IoU prediction and from there we go both forward and backward in the video to produce a full masklet. Note that with appropriate caching this only requires re-running the mask-decoders twice for each frame, which is less than 10% extra overhead. On all RefVOS datasets we report J&F scores Perazzi et al. (2016) (mean of IoU and contour accuracy) averaged on all text queries. Results on RefVOS-YTVOS and MeViS were obtained using the official test servers.
\para
Results. Tab. 5 compares with the state-of-the-art. Note that SAM2 can only do SS-VOS. ReferFormer Wu et al. (2022), SOC Luo et al. (2024), and DsHmp He and Ding (2024) can only do RefVOS. UniRef++ Wu et al. (2023) can handle both tasks, and GLEE Wu et al. (2024a) is a multi-task model which can also perform classical detection and segmentation. However, none of the compared methods can do captioning.
Tab. 5 shows that on the RefVOS task, if we use FindTrack at test-time, our model outperforms the state of the art on all datasets; by +4.8% on MeViS, +0.6% on RefVOS-YTVOS, +0.5% on RefVOS-DAVIS, and +16.5% on UVO-VLN. If we do not apply FindTrack Cho et al. (2025), our model is still best on most datasets except for RefVOS-YTVOS, where GLEE is best. Now GLEE does tracking by detection, which requires making predictions for all frames before it runs an algorithm to merge these per-frame predictions into masklets; it needs to analyze the whole video first. In contrast, our model without FindTrack is strictly online which is more applicable in practice but which is a harder task. FindTrack can be used for offline use and gives significant boosts on RefVOS-YTVOS (+0.9%) and MeViS (+1.1%).
On SS-VOS our model outperforms the multi-task models GLEE and UniRef++ on semi-supervised visual object segmentation: on YTVOS 2018 our method yields 85.5 while UniRef++ yields and GLEE yields . On the more difficult MOSE dataset the differences are even larger: We obtain 66.3 J&F while UniRef++ obtains 59.0 and GLEE 56.1. Overall, we conclude that our model is state-of-the-art on video object segmentation, while it can additionally produce captions.
5.4 Ablation on the effectiveness of SAV-Caption Train and general data mix
To better understand the importance of our automatically annotated dataset, we ablate the effectiveness of our automatically annotated training set (\refsecautomatic_annotation). In particular, we compare our model which was produced by our multi-task training phase (see Tab. 2) to several models which we we train in the same manner but by using increasingly less of our automatically annotated SAV-Caption-train set. However, as SAV is our only video object captioning source we make up for the loss of such data by inverting RefVOS to become a captioning dataset (following Zhou et al. (2023)) by considering the query text prompt, which is normally an input, as the output caption for the object. In addition, we also show results when doing the full training schedule. Tab. 6 shows the results.
First of all, the finetuning phase improves results, which is not very surprising. More importantly, if we use increasingly less SAV-Caption data in multi-task training, SAV-caption performance starts dropping, and completely collapses without any SAV-caption training data. This demonstrates that our automatic annotation is essential to obtaining good captioning performance on SAV-Caption val. Interestingly, we observe a significant drop from 68.7% J&F to 66.6% when removing all SAV-caption data. This demonstrates that our model is able able to exploit the synergies between this task and object captioning.
6 Conclusion
We proposed a video object segmentation and captioning model that takes either a box, mask or text prompt as input. We manually collected evaluation data for this task, and proposed an automatic annotation pipeline to curate training data. VoCap trained on our \savsdataset together with diverse existing datasets outperforms the state-of-the-art on referring expression video object segmentation and reaches top-tier performance on our captioning task and semi-supervised video segmentation. We hope our model and datasets provide a foundation for fine-grained spatio-temporal video understanding, and encourages more work in this direction.
Appendix A Details on Dataset
A.1 Quality of SAV-Caption-train
We performed a quantitative evaluation on the quality of the SAV-Caption-train set by having the authors examine captions of 50 randomly selected objects from 50 different videos. They verified separately whether each of the elements used from our structured prompt was correct or not: the object category, its properties, and what the object does (e.g. motion or action). Furthermore, we counted how many properties and actions were obvious yet not generated in the caption. Results are in Tab. 7.
The object category was correct in 88.0% of the cases. When an object was incorrect it was either subtle (e.g. sock instead of shoe) or it was a piece of clothing worn by a human and the human was captioned instead. Properties are also correct in 87.6% of the cases. Many mistakes were subtle color differences due to lighting conditions. When the human was described instead of their clothing worn, we counted these properties as incorrect (even if they were correct for the human). There were a few properties noticeably absent, mostly because of the context of other mentioned properties. For example, one caption mentioning a white striped sweater, whereas the sweater was blue-white striped, which conveys a quite different appearance of the sweater. The object’s motion/action was correct in 85.5% of the cases. Most mistakes were subtle differences between standing still or driving / walking slowly. Similarly as before, when the motion/action was of the wrong category (person instead of sweater), we counted this as incorrect. In 7 instances we found an action to be clearly missing. This was usually when the object did multiple things sequentially (e.g. a person is first standing, then walking away) where one of them was missing. In another there was a parrot which was correctly identified to be laying on the floor, but they did this to scratch their head on the floor; a crucial aspect to understand its behavior.
To conclude, while there is some noise in the automatically generated data we consider it to be of decent quality. Moreover, in our main paper we clearly demonstrate the usefulness of this data for training captioning models.
A.2 Text prompt to generate \savstrain
We use the following prompt together with our vision prompts to generate the pseudo-labels of our training set.
Describe the subject in the red contour in the following video. If the subject is a part of an object, please describe this part instead of the whole object. Please DO NOT DESCRIBE anything in the blurred background outside the red contour. First determine the subject’s category (CATEGORY), properties (PROPERTIES), action (ACTION), and then give a description in ONE sentence (DESCRIPTION) including category, properties, and action, etc.. Please use this FORMAT: ’The video shows a CATEGORY. The subject’s properties are PROPERTIES. The subject’s action is ACTION. DESCRIPTION.’. The DESCRIPTION starts with ’A/ An CATEGORY’ or ’A/ An PROPERTIES CATEGORY’ if it is grammarly more proper to put the properties before the category. The category, properties, motion and the descriptions should be consistent. PROPERTIES should be about the objects appearance (color, texture, size, material, shape), what it is wearing or a functional property (e.g. fast, sharp). Please always include interesting or unexpected properties. If there are multiple actions happening sequentially, connect them with ’then’, but do not include more than 3 actions. For static objects or parts, just say the ACTION is ’static’ and it is OK to not include ACTION in DESCRIPTION. Please DO NOT mention the red contour in the description. If the subject is a person, please avoid describing the person’s skin color and describe the person’s clothes color instead. You only need to describe the details that you are certain about. If you cannot perform the task or you are very uncertain, please say ‘I cannot perform the task for this video.’.
Appendix B SAM2 baseline details
The original SAM2 Ravi et al. (2024) was trained on private datasets in addition to the publicly-released SAV training and validation set. The publicly-released SAM2 training code222https://github.com/facebookresearch/sam2 includes finetuning pipeline on MOSE dataset Ding et al. (2023b), but does not include the main training loop. Therefore, before adapting SAM2 in our use case, we attempt to reproduce SAM2 training in our framework in Jax Bradbury et al. (2018); Dehghani et al. (2022). We also repleace the MAE-pretrained backbone Hiera Ryali et al. (2023) with a more vision-langauge native backbone Eva02 Fang et al. (2023). When using Eva02, we reduce the input resolution from the original 1024 to 512 to fit our hardware, and verified minimal performance drop compare to the official SAM2 with Hiera-T and 1024 input size. We do not use the SA1B dataset Kirillov et al. (2023) for pretraining as we did not find it helpful in our target datasets. We adapt the training hyper-parameters in Table 12 (b) of the SAM2 paper, which we summarize in \reftblhyperparams-sam2.
As a result, our reproduced SAM2 with Eva02 Fang et al. (2023) and a smaller input size trained on public data closely matches the official released model, as shown \reftblsam2-reproduce.
Appendix C Qualitative examples: VoCap vs SAM2 + Gemini Pseudo captioning
Tab. 4 demonstrates that VoCap outperforms the strong baseline of applying SAM2 followed by Gemini pseudo-labeling. Figure 4 illustrates why: The Gemini captions are often wrong for small objects (both examples). Furthermore, when a human (hand) or animal is near the object, Gemini 1.5 Pro sometimes describes this actor instead of the highlighted object (example on the left in Fig. 4).
Appendix D Training hyper-parameters
We include the full hyper-parameters used in our multi-task training phase in \reftblhyperparams-main
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Neur IPS , 2022.
- 2Bain et al. (2021) Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In CVPR , 2021.
- 3Beery et al. (2020) Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, and Jonathan Huang. Context r-cnn: Long term temporal context for per-camera object detection. In CVPR , 2020.
- 4Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018.
- 5Caelles et al. (2019) Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, Alberto Montes, Kevis-Kokitsi Maninis, and Luc Van Gool. The 2019 davis challenge on vos: Unsupervised multi-object segmentation. ar Xiv:1905.00737 , 2019.
- 6Caesar et al. (2020) Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR , 2020.
- 7Chai et al. (2023) Wenhao Chai, Xun Guo, Gaoang Wang, and Yan Lu. Stablevideo: Text-driven consistency-aware diffusion video editing. In CVPR , 2023.
- 8Chen et al. (2022) Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. ar Xiv:2209.06794 , 2022.
