RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios
Tianyi Zhao, Jiawen Xi, Linhui Xiao, Junnan Li, Xue Yang, Maoxun Yuan, Xingxing Wei

TL;DR
This paper introduces RGBT-Ground, a large-scale benchmark for visual grounding in complex real-world scenarios using RGB and thermal images, and proposes a new model RGBT-VGNet for robust multi-modal grounding.
Contribution
It presents the first comprehensive benchmark for visual grounding in diverse real-world conditions and introduces a multi-modal framework and baseline model for improved robustness.
Findings
RGBT-VGNet outperforms existing adapted methods.
Significant improvements in nighttime and long-distance scenarios.
Benchmark enables evaluation under complex environmental conditions.
Abstract
Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
