MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation
Yuanbing Zhu, Bingke Zhu, Yingying Chen, Yunfang Niu, Ming Tang,, Jinqiao Wang

TL;DR
MROVSeg introduces a multi-resolution training framework for open-vocabulary image segmentation using a single CLIP backbone, effectively capturing fine details without high computational costs, and achieves state-of-the-art results.
Contribution
It proposes a novel multi-resolution training method with a Multi-Res Adapter and Masked Attention to improve segmentation detail and accuracy using only one pretrained CLIP model.
Findings
Outperforms existing methods on standard benchmarks.
Effectively captures fine details in high-resolution images.
Reduces computational overhead compared to multi-backbone approaches.
Abstract
Pretrained vision-language models (VLMs), \eg CLIP, are increasingly used to bridge the gap between open- and close-vocabulary recognition in open-vocabulary image segmentation. As VLMs are generally pretrained with low-resolution images (e.g. ), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. A typical solution is to employ additional image backbones for high-resolution inputs, but it also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
MethodsSoftmax · Attention Is All You Need · High-resolution input · Adapter · Contrastive Language-Image Pre-training
