Advancing the Understanding and Evaluation of AR-Generated Scenes: When   Vision-Language Models Shine and Stumble

Lin Duan; Yanming Xiu; Maria Gorlatova

arXiv:2501.13964·cs.CV·February 4, 2025

Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble

Lin Duan, Yanming Xiu, Maria Gorlatova

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the ability of state-of-the-art vision-language models to perceive and describe augmented reality scenes, highlighting their strengths in detecting obvious virtual objects and limitations with seamlessly integrated content.

Contribution

It introduces DiverseAR, a novel AR dataset, and systematically assesses VLMs' performance in AR scene analysis, revealing their potential and current limitations.

Findings

01

VLMs achieve up to 93% TPR in perception tasks

02

VLMs describe AR scenes with 71% accuracy

03

Performance is affected by virtual content placement and rendering quality

Abstract

Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs -- GPT, Gemini, and Claude -- in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs' ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arresearch-1/diversear-dataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage · Virtual Reality Applications and Impacts · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Discriminative Fine-Tuning · Layer Normalization · Cosine Annealing · Dense Connections · Softmax · Adam