AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation
Tongfei Chen, Shuo Yang, Yuguang Yang, Linlin Yang, Runtang Guo, Changbai Li, He Long, Chunyu Xie, Dawei Leng, Baochang Zhang

TL;DR
This paper introduces AML, a training strategy for Referring Image Segmentation that improves alignment between visual and textual data by filtering unreliable pixels, leading to state-of-the-art results without architectural changes.
Contribution
The paper proposes AML, a novel alignment-aware masked learning method that enhances RIS training by explicitly estimating and filtering pixel-level alignment, improving accuracy and robustness.
Findings
Achieves state-of-the-art results on RefCOCO datasets.
Enhances model robustness to diverse descriptions.
No additional inference overhead required.
Abstract
Referring Image Segmentation (RIS) aims to segment the object in an image uniquely referred to by a natural language expression. However, RIS training often contains hard-to-align and instance-specific visual signals; optimizing on such pixels injects misleading gradients and drives the model in the wrong direction. By explicitly estimating pixel-level vision-language alignment, the learner can suppress low-alignment regions, concentrate on reliable cues, and acquire more generalizable alignment features. In this paper, we propose Alignment-Aware Masked Learning (AML), a simple yet effective training strategy that quantifies region-referent alignment (PMME) and filters out unreliable pixels during optimization (AFM). Specifically, each sample first computes a similarity map between visual and textual features, and then masks out pixels falling below an adaptive similarity threshold,…
Peer Reviews
Decision·ICLR 2026 Poster
• The projection design provides a novel approach for measuring similarity between representations of different modalities. This method can be extended to more tasks. • Experiments demonstrate the effectiveness of the proposed structure, achieving competitive results across multiple downstream datasets. • The proposed structure does not significantly increase training overhead while maintaining inference time. • The proposed idea is interesting and generally well-motivated, and the experimental
My primary concern is the method’s sensitivity to small or low-contrast objects. The AML framework relies on PMME to generate alignment-based masks by identifying high-confidence visual patches. This mechanism inherently depends on the relative distribution of features within the image. As a result, small objects or objects with low visual saliency may produce low peak alignment scores and be incorrectly masked out during training. Consequently, the model’s performance may degrade on images wher
(1) The motivation is well presented of using the proposed alignment-aware masked learning approach for referring image segmentation. (2) The explanations and illustrations are mostly clear and intuitive of the PatchMax Matching Evaluation, the alignment-aware filtering mask and the training strategy.
(1) The approach of using a previous-step inference for mask prediction and guide the current learning may face convergence issue. In fact, the initial state of mask is largely incorrect and can result in unexpected learning curves. There is no discussion on this issue. (2) On the fairness of experimental comparison, since CARIS+AML uses 17.2% more training time than CARIS (according to Appendix G.2), the performance gain in Table 1 is also possibly coming from longer training. There is no abla
- This work proposes Alignment-Aware Masked Learning (AML), a training strategy that quantifies region–referent alignment (PMME) and filters out unreliable pixels during optimization (AFM), which is validated to improve RIS performance. - The writing overall is good and it is easy for readers to understand the proposed framework. - The experiments analysis is detailed for readers to realize the benefits of AML.
- Motivation clarification: the motivation is not well clarified from figure-1. I suppose the author's motivation is that a number of regions (especially background regions) dominate the training loss. - Method contributions - Based on the above motivation, I am more inclined to believe that this work is actually an implementation of curriculum learning in the RIS task. It is also be validated from the efficiency of early-training stage. In view of the originality, it decreases the contribut
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
