ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation

Ziquan Liu; Zhewei Zhu; Xuyang Shi

arXiv:2512.24224·cs.CV·January 1, 2026

ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation

Ziquan Liu, Zhewei Zhu, Xuyang Shi

PDF

Open Access

TL;DR

This paper introduces ARM, a learnable module that enhances CLIP's internal features for open-vocabulary semantic segmentation, achieving better performance without extensive retraining or external models.

Contribution

ARM is a novel, lightweight, learnable module that adaptively refines CLIP features, enabling universal plug-and-play improvements for training-free OVSS frameworks.

Findings

01

Consistently improves baseline performance across multiple benchmarks.

02

Operates with negligible inference overhead.

03

Validates effectiveness as a universal post-processor.

Abstract

Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Natural Language Processing Techniques