PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

Zekai Lin; Xu Zheng

arXiv:2602.21992·cs.CV·February 26, 2026

PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

Zekai Lin, Xu Zheng

PDF

Open Access

TL;DR

This paper introduces PanoEnv, a large-scale panoramic VQA benchmark and a reinforcement learning framework that significantly improves 3D spatial reasoning in vision-language models for panoramic environments.

Contribution

The paper presents a new benchmark for 3D reasoning in panoramic images and a RL-based training method that enhances models' 3D spatial understanding beyond existing approaches.

Findings

01

Baseline models have limited 3D understanding with ~49% accuracy.

02

The proposed RL framework improves accuracy to over 52%.

03

The 7B model surpasses larger models in semantic evaluation scores.

Abstract

360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization