Adaptive Masking Enhances Visual Grounding
Sen Jia, Lei Li

TL;DR
This paper introduces IMAGE, a novel adaptive masking technique inspired by cognitive science and masked autoencoders, to improve low-shot visual grounding without increasing dataset size, demonstrating superior performance on benchmarks.
Contribution
Proposes IMAGE, an adaptive masking method that enhances low-shot visual grounding by focusing on salient regions, reducing reliance on large datasets.
Findings
Outperforms baseline models on COCO and ODinW datasets
Achieves better generalization in zero-shot and few-shot tasks
Demonstrates effectiveness of adaptive masking with Gaussian modeling
Abstract
In recent years, zero-shot and few-shot learning in visual grounding have garnered considerable attention, largely due to the success of large-scale vision-language pre-training on expansive datasets such as LAION-5B and DataComp-1B. However, the continuous expansion of these datasets presents significant challenges, particularly with respect to data availability and computational overhead, thus creating a bottleneck in the advancement of low-shot learning capabilities. In this paper, we propose IMAGE, Interpretative MAsking with Gaussian radiation modEling, aimed at enhancing vocabulary grounding in low-shot learning scenarios without necessitating an increase in dataset size. Drawing inspiration from cognitive science and the recent success of masked autoencoders (MAE), our method leverages adaptive masking on salient regions of the feature maps generated by the vision backbone. This…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The authors propose IMAGE, a method that, unlike conventional masking approaches, strategically obscures salient regions to encourage models to reason more effectively from partial observations.
The author mentions that the training strategy for occluding salient targets is proposed by drawing on human perception. However, human perception also relies on the relationship between the background and foreground to identify objects. It remains unclear how this aspect was considered or modeled in the proposed approach. In the Adaptive Mask Generation Block, if the occlusion ratio is excessively high, could it lead the model to overfit certain features, thereby negatively affecting overall pe
1. The paper provides some theoretical derivations and offers theoretical verification. 2. Experiments demonstrate the superiority of the proposed method over random masking.
1. The Introduction contains logical leaps, especially lacking clear explanations of key concepts and analysis of related works. (1) The core of low-shot learning lies in the scarcity of samples - how is this related to feature missing? (2) In low-shot/few-shot scenarios, why does the paper choose the masking-based approach for CLIP training? What are the main challenges faced by existing low-shot/few-shot CLIP methods? (3) The core idea of this work is the adaptive masking of salient visual
- Using saliency maps to guide model training is interesting and important because it makes sense how to use interpretable results to improve model performance or fix model errors. - The authors validated the effectiveness of the proposed method on several object detection tasks.
- Priors based on feature maps are often unreliable, especially in the ViT architecture. Although the attention map based on DINO v1 is more in line with human cognition, attention maps like CLIP or DINO v2 are not in line with human cognition, but still have good performance [1]. - The experiments in this paper appear to primarily focus on object detection tasks, utilizing only the visual grounding paradigm of Grounding DINO. I suggest the authors supplement this with experiments on referring e
1. The paper is well-motivated. Masking salient regions instead of backgrounds is intuitive. 2. The paper provide theoretical foundation making it soundness.
1. The paper only focus on referring expression comprehension (REC) in visual grounding (VG), without attention about referring expression generation (REG) in VG. 2. Too many related works and compared methods in visual grounding are missed, such as [1-6]. It is suggested to add more detailed discussions and comparsions. 3. Besides, more comprehensive benchmarks should be added, including RefCOCO/+/g, Visual Grenome and FineCops-Ref. 4. Meanwhile, most powerful VLMs (e.g., Qwen-VL, InternVL,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual perception and processing mechanisms · Virtual Reality Applications and Impacts · Human-Automation Interaction and Safety
MethodsSoftmax · Attention Is All You Need · L1 Regularization · Adaptive Masking
