MROVSeg: Breaking the Resolution Curse of Vision-Language Models in   Open-Vocabulary Image Segmentation

Yuanbing Zhu; Bingke Zhu; Yingying Chen; Yunfang Niu; Ming Tang,; Jinqiao Wang

arXiv:2408.14776·cs.CV·November 28, 2024

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

Yuanbing Zhu, Bingke Zhu, Yingying Chen, Yunfang Niu, Ming Tang,, Jinqiao Wang

PDF

Open Access

TL;DR

MROVSeg introduces a multi-resolution training framework for open-vocabulary image segmentation using a single CLIP backbone, effectively capturing fine details without high computational costs, and achieves state-of-the-art results.

Contribution

It proposes a novel multi-resolution training method with a Multi-Res Adapter and Masked Attention to improve segmentation detail and accuracy using only one pretrained CLIP model.

Findings

01

Outperforms existing methods on standard benchmarks.

02

Effectively captures fine details in high-resolution images.

03

Reduces computational overhead compared to multi-backbone approaches.

Abstract

Pretrained vision-language models (VLMs), \eg CLIP, are increasingly used to bridge the gap between open- and close-vocabulary recognition in open-vocabulary image segmentation. As VLMs are generally pretrained with low-resolution images (e.g. $224 \times 224$ ), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. A typical solution is to employ additional image backbones for high-resolution inputs, but it also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling

MethodsSoftmax · Attention Is All You Need · High-resolution input · Adapter · Contrastive Language-Image Pre-training