ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
Anurag Bagchi, Zhipeng Bao, Yu-Xiong Wang, Pavel Tokmakov, Martial Hebert

TL;DR
REM is a versatile video segmentation framework that leverages visual-language models to accurately segment a wide range of objects and dynamic concepts, including unseen and rare categories, by fine-tuning on small datasets.
Contribution
It introduces a novel fine-tuning approach that preserves the generative model's architecture, enabling generalization to unseen objects and dynamic concepts in video segmentation.
Findings
Performs on par with state-of-the-art in-domain
Outperforms by up to 12 IoU points out-of-domain
Generalizes to non-object dynamic concepts
Abstract
We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method leverages the universal visual-language mapping learned by video diffusion models on Internet-scale data by fine-tuning them on small-scale Referring Object Segmentation datasets. Our key insight is to preserve the entirety of the generative model's architecture by shifting its objective from predicting noise to predicting mask latents. The resulting model can accurately segment rare and unseen objects, despite only being trained on a limited set of categories. Additionally, it can effortlessly generalize to non-object dynamic concepts, such as smoke or raindrops, as demonstrated in our new benchmark for Referring Video Process Segmentation (Ref-VPS). REM performs on par with the state-of-the-art on in-domain datasets, like Ref-DAVIS, while…
Peer Reviews
Decision·Submitted to ICLR 2025
* REM achieves good generalization to out-of-domain concepts and non-object entities by leveraging Internet-scale pretraining on video-language data, outperforming specialized methods. * The Ref-VPS benchmark provides a way to evaluate models on dynamic processes, filling a gap in existing RVS datasets by focusing on temporal events and continuous appearance changes.
* The REM method performs comparably to MUTR and only slightly better then VLMO on Ref-YTB. * Given use of a high-capacity diffusion model, there is limited discussion on its computational efficiency or inference speed. A comparison of REM's computational requirements with prior methods would be beneficial. * Analysis on the impact of the quantity and quality of synthetic data would be useful - whether reducing/increasing the amount of synthetic data affect REM's performance. * Since REM is
The paper tackles an interesting problem in video segmentation. The approach is straightforward and well-motivated, leveraging existing diffusion models. The contributions are clearly articulated, and the paper is well-structured. The authors commit to releasing code, models, and data, which supports reproducibility of their results by the community.
Evaluation of Ref-VPS: Dynamic concepts such as a wave or a cloud of smoke don't have clearly defined edges. Providing a single ground-truth contour for them is necessarily arbitrary. How then to evaluate? For example, in Figure 1, the mask for “the smoke dissipating” arguably does not cover all the smoke. Similarly, for ”the wave crashing in the ocean”, the ground-mask does not include the wave foam on the left. A model prediction that includes these elements would be unjustly penalised. Segme
1. REM has a straightforward design for training and inference that is presented clearly. In general the writing is clear. 2. REM results are strong on Ref-VPS and Ref-YTB and the ablation study effectively validates the author's claim that retaining the architecture of diffusion models is important for taking advantage of their generalizable representations for referring video segmentation. The finding that improving video generation models leads to enhanced video segmentation performance is in
1. The motivation of the paper is not very clear. Specifically, it is not clear whether the authors are advocating for distinguishing the Referring Video Object Segmentation benchmarks from the Referring Video Process Segmentation benchmark that they propose. If the authors are advocating for this (and thus for two different tasks), then the final output of the model should be evaluated differently for each task, but this doesn't seem to be the case. 2. The Ref-VPS dataset was created from heavi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Pose and Action Recognition
MethodsDense Connections · Convolution · Q-Learning · Deep Q-Network · Diffusion · Sparse Evolutionary Training · Random Ensemble Mixture
