ACTRESS: Active Retraining for Semi-supervised Visual Grounding
Weitai Kang, Mengxue Qu, Yunchao Wei, Yan Yan

TL;DR
This paper introduces ACTRESS, a novel active retraining framework for semi-supervised visual grounding that improves model performance by selective pseudo-labeling and periodic retraining, addressing limitations of previous methods.
Contribution
The paper proposes ACTRESS, a new framework that incorporates detection confidence, active sampling, and selective retraining to enhance semi-supervised visual grounding models.
Findings
Superior performance on benchmark datasets
Effective pseudo-label selection via Faithfulness, Robustness, and Confidence
Enhanced model robustness through periodic retraining
Abstract
Semi-Supervised Visual Grounding (SSVG) is a new challenge for its sparse labeled data with the need for multimodel understanding. A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision. However, this approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline. These pipelines directly regress results without region proposals or foreground binary classification, rendering them unsuitable for fitting in RefTeacher due to the absence of confidence scores. Furthermore, the geometric difference in teacher and student inputs, stemming from different data augmentations, induces natural misalignment in attention-based constraints. To establish a compatible SSVG framework, our paper proposes the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
