Dense360: Dense Understanding from Omnidirectional Panoramas
Yikang Zhou, Tao Zhang, Dizhe Zhang, Shunping Ji, Xiangtai Li, Lu Qi

TL;DR
This paper introduces Dense360, a comprehensive dataset and benchmark for dense visual understanding from omnidirectional panoramas, addressing unique challenges in panoramic encoding and enabling advanced multimodal language models.
Contribution
It provides the first large-scale panoramic dataset with dense annotations, a novel position encoding scheme ERP-RoPE, and a benchmark for evaluating panoramic visual-language understanding.
Findings
Created a dataset with 160K panoramas and 5M dense captions
Developed ERP-RoPE encoding scheme for panoramic images
Established Dense360-Bench for evaluating panoramic understanding
Abstract
Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging
