PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning
Zekai Lin, Xu Zheng

TL;DR
This paper introduces PanoEnv, a large-scale panoramic VQA benchmark and a reinforcement learning framework that significantly improves 3D spatial reasoning in vision-language models for panoramic environments.
Contribution
The paper presents a new benchmark for 3D reasoning in panoramic images and a RL-based training method that enhances models' 3D spatial understanding beyond existing approaches.
Findings
Baseline models have limited 3D understanding with ~49% accuracy.
The proposed RL framework improves accuracy to over 52%.
The 7B model surpasses larger models in semantic evaluation scores.
Abstract
360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
