MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding

Panquan Yang; Junfei Huang; Zongzhangbao Yin; Yingsong Hu; Anni Xu; Xinyi Luo; Xueqi Sun; Hai Wu; Sheng Ao; Zhaoxing Zhu; Chenglu Wen; Cheng Wang

arXiv:2512.24605·cs.CV·January 1, 2026

MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding

Panquan Yang, Junfei Huang, Zongzhangbao Yin, Yingsong Hu, Anni Xu, Xinyi Luo, Xueqi Sun, Hai Wu, Sheng Ao, Zhaoxing Zhu, Chenglu Wen, Cheng Wang

PDF

Open Access

TL;DR

This paper introduces MoniRefer, a large-scale real-world multi-modal dataset for 3D visual grounding in outdoor traffic scenes, along with a novel end-to-end method Moni3DVG for accurate object localization.

Contribution

It presents the first roadside-level 3D visual grounding dataset and a new multi-modal learning approach, advancing outdoor traffic scene understanding.

Findings

01

MoniRefer contains over 136,000 objects with natural language descriptions.

02

Moni3DVG outperforms existing methods in 3D object localization accuracy.

03

The dataset and method significantly improve outdoor traffic scene analysis.

Abstract

3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences. It is very critical for roadside infrastructure system to interpret natural languages and localize relevant target objects in complex traffic environments. However, most existing datasets and approaches for 3D visual grounding focus on the indoor and outdoor driving scenes, outdoor monitoring scenarios remain unexplored due to scarcity of paired point cloud-text data captured by roadside infrastructure sensors. In this paper, we introduce a novel task of 3D Visual Grounding for Outdoor Monitoring Scenarios, which enables infrastructure-level understanding of traffic scenes beyond the ego-vehicle perspective. To support this task, we construct MoniRefer, the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization