Dense360: Dense Understanding from Omnidirectional Panoramas

Yikang Zhou; Tao Zhang; Dizhe Zhang; Shunping Ji; Xiangtai Li; Lu Qi

arXiv:2506.14471·cs.CV·June 18, 2025

Dense360: Dense Understanding from Omnidirectional Panoramas

Yikang Zhou, Tao Zhang, Dizhe Zhang, Shunping Ji, Xiangtai Li, Lu Qi

PDF

Open Access

TL;DR

This paper introduces Dense360, a comprehensive dataset and benchmark for dense visual understanding from omnidirectional panoramas, addressing unique challenges in panoramic encoding and enabling advanced multimodal language models.

Contribution

It provides the first large-scale panoramic dataset with dense annotations, a novel position encoding scheme ERP-RoPE, and a benchmark for evaluating panoramic visual-language understanding.

Findings

01

Created a dataset with 160K panoramas and 5M dense captions

02

Developed ERP-RoPE encoding scheme for panoramic images

03

Established Dense360-Bench for evaluating panoramic understanding

Abstract

Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging