OpenView: Empowering MLLMs with Out-of-view VQA

Qixiang Chen; Cheng Zhang; Chi-Wing Fu; Jingwen Ye; Jianfei Cai

arXiv:2512.18563·cs.CV·December 23, 2025

OpenView: Empowering MLLMs with Out-of-view VQA

Qixiang Chen, Cheng Zhang, Chi-Wing Fu, Jingwen Ye, Jianfei Cai

PDF

Open Access 1 Datasets

TL;DR

This paper introduces OpenView, a novel framework and dataset for out-of-view visual question answering, enabling multimodal models to reason beyond visible image content, significantly improving their performance.

Contribution

The paper presents a four-stage pipeline, a synthetic panoramic dataset, and a benchmark for out-of-view VQA, advancing the ability of MLLMs to understand beyond the visible image frame.

Findings

01

MLLMs performance improved from 48.6% to 64.1% on average with OpenView.

02

OpenView dataset enables effective supervised fine-tuning for out-of-view reasoning.

03

OpenView benchmark provides a new standard for evaluating out-of-view VQA capabilities.

Abstract

Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

7xiang/OpenView-Dataset
dataset· 39 dl
39 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis