MaskInversion: Localized Embeddings via Optimization of Explainability Maps
Walid Bousselham, Sofian Chaybouti, Christian Rupprecht, Vittorio Ferrari, Hilde Kuehne

TL;DR
MaskInversion introduces a novel optimization-based method to generate localized, context-aware image region embeddings using explainability maps from pre-trained vision-language models, enhancing tasks like retrieval and captioning.
Contribution
It presents MaskInversion, a new approach that refines region embeddings by optimizing explainability maps, applicable to any pre-trained foundation model without retraining.
Findings
Effective in localized image understanding tasks
Outperforms state-of-the-art methods on multiple datasets
Enables broad applications like captioning and image generation
Abstract
Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the foundation model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use…
Peer Reviews
Decision·ICLR 2026 Poster
(1) MaskInversion is training-free can be used generally for region-based retrieval tasks. (2) The authors provide many ablation studies to validate the proposed method.
(1) In the paper, the authors claim MaskInversion can be used for any pre-trained model. Have the authors conduct experiments on any vision-language models such as llava? (2) In the section 4.2, the authors mention that for optimization of MaskInversion, they set "10 gradient descent iterations". I'm curious how this number of iterations is decided? (3) Have the authors try other loss functions such as simple MSE loss?
1. Avoiding the "domain gap" of input-modification methods (e.g., RedCircle, Masked Crop) and the "data hunger" of fine-tuning methods (e.g., AlphaCLIP). By keeping the foundation model frozen and only optimizing an embedding token, MaskInversion retains the pretrained model’s knowledge while enabling task-agnostic localization. The gradient decomposition strategy to reduce computational cost for multiple masks. This addresses a practical bottleneck of gradient-based explainability (expensive se
1. Upscaling increases the number of visual tokens (n), which affects the gradient decomposition’s efficiency (Equation 5 depends on n). The paper’s runtime ablation (Table 5) uses standard resolutions but not high-res inputs, leaving a gap in practicality for high-detail tasks (e.g., medical imaging). 2. For images with ≥5 objects, does MaskInversion’s performance degrade (e.g., due to cross-mask interference)? The current class retrieval and referring expression tasks focus on single masks, no
* Novel and Elegant Method: The core idea of optimizing a new embedding by forcing its explainability map to match a target region is very clever. It's an elegant way to "invert" the model's attribution mechanism to achieve spatial control. * Training-Free and Practical: The method works entirely at test time and keeps the powerful foundation model (like CLIP) completely frozen. This makes it extremely practical, versatile, and broadly applicable to any model that can produce a differentiable ex
The original CLIP [CLS] token is powerful because it is aligned with text in a massive, open-vocabulary embedding space. It is not fully clear if the MaskInversion optimization, which pulls the token towards a spatial objective, fully preserves (or ideally, enhances) this fine-grained semantic alignment for general open-vocabulary classification. The regularization term ($L_{reg}$) helps, but this is a trade-off.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging · AI in cancer detection · Machine Learning in Healthcare
MethodsContrastive Language-Image Pre-training
