Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble
Lin Duan, Yanming Xiu, Maria Gorlatova

TL;DR
This paper evaluates the ability of state-of-the-art vision-language models to perceive and describe augmented reality scenes, highlighting their strengths in detecting obvious virtual objects and limitations with seamlessly integrated content.
Contribution
It introduces DiverseAR, a novel AR dataset, and systematically assesses VLMs' performance in AR scene analysis, revealing their potential and current limitations.
Findings
VLMs achieve up to 93% TPR in perception tasks
VLMs describe AR scenes with 71% accuracy
Performance is affected by virtual content placement and rendering quality
Abstract
Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs -- GPT, Gemini, and Claude -- in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs' ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Virtual Reality Applications and Impacts · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Discriminative Fine-Tuning · Layer Normalization · Cosine Annealing · Dense Connections · Softmax · Adam
