Beyond Quantity: Distribution-Aware Labeling for Visual Grounding
Yichi Zhang, Gongwei Chen, Jun Zhu, Jia Wan, Liqiang Nie

TL;DR
This paper introduces DAL, a distribution-aware labeling framework for visual grounding that enhances data diversity and quality by expanding semantic coverage and filtering noise, leading to improved performance.
Contribution
The paper presents a novel distribution-aware labeling method that combines reliable pseudo-labeling with explicit out-of-distribution expansion for better visual grounding.
Findings
DAL outperforms strong baselines on three benchmarks.
Distribution-aware filtering improves data quality and training efficiency.
State-of-the-art results demonstrate the effectiveness of the approach.
Abstract
Visual grounding requires large and diverse region-text pairs. However, manual annotation is costly and fixed vocabularies restrict scalability and generalization. Existing pseudo-labeling pipelines often overfit to biased distributions and generate noisy or redundant samples. Through our systematic analysis of data quality and distributional coverage, we find that performance gains come less from raw data volume and more from effective distribution expansion. Motivated by this insight, we propose DAL, a distribution-aware labeling framework for visual grounding. The proposed method first employs a dual-driven annotation module, where a closed-set path provides reliable pseudo labels and an open-set path enriches vocabulary and introduces novel concepts; meanwhile, it further performs explicit out-of-distribution (OOD) expression expansion to broaden semantic coverage. We then propose a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
See Questions.
See Questions.
1. Experiments on three tasks, including Referring Expression Comprehension (REC), Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES) tasks, show that the visual grounding performance can benefit from the generated pseudo-data, and also demonstrate the generalization ability of the proposed data augmentation method. 2. This paper is well written. The proposed dual-driven annotation and the data filtering operations are well described and easy to fo
1. One of the major weaknesses lies in the novelty of the data generation and filtering strategies. As a main contribution of this paper, the open-set and closed-set annotation operations follow a standard practice. Although the authors introduce an additional OOD expression expansion operation, the improvements it brings are relatively small (see Table 4, by introducing 90K extra data, the performance of “+ OOD expansion” is only slightly better than that of the dual-driven strategy). How about
This review evaluates the paper's quality based on the following criteria: task relevance, related work, technical novelty, technical correctness, experimental validation, writing and presentation, and reproducibility. Each aspect is discussed and highlighted as a strength or a weakness in the sections below. - **Relevance of the task:** Visual Grounding of Referring Expressions is a highly relevant problem for the ICLR community. This paper presents state-of-the-art results on benchmark data
- **Reproducibility and Implementation Details:** It is not indicated whether the source code will be released, and it's not included as part of the submission. - **Related Work and Technical Novelty:** The Related Work section does not adequately contextualize the contributions. It is not clear how the proposed method addresses the limitations of current pseudo-label generation methods for visual grounding. Specifically, how does the proposed method eliminate the need for human-labeled te
The paper persuasively shows that gains come from coverage and imbalance correction, not raw data volume, supported by distribution visualizations (caption types, subset features); its pipeline—GMM-guided OOD expansion + DPO and a two-stage filter (IoU + CLIP semantics with a density band-pass)—is reproducible and well aligned with the goal, and experiments report consistent improvements across REC/RES/GRES with modern backbones, with ablations disentangling the effects of annotation strategy, f
1. In the ablation of $\tau_{\text{semantic}}$ (Fig. 6), it is unclear what “data scale” means (images, regions per image, or captions per region); please define it and expand the sweep to a wider hyperparameter range, including interactions with $\tau_{\text{spatial}}$ and GMM K, to provide sensitivity curves. 2. In Fig. 4, the distribution plots do not explain what each point represents, which is confusing; please add a legend, annotate axes and units, specify whether points are images/regions
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Image Retrieval and Classification Techniques · Video Analysis and Summarization
MethodsLinear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Byte Pair Encoding · Dropout
