Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

Yanming Xiu; Zhengyuan Jiang; Neil Zhenqiang Gong; Maria Gorlatova

arXiv:2604.05510·cs.CV·April 14, 2026

Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

Yanming Xiu, Zhengyuan Jiang, Neil Zhenqiang Gong, Maria Gorlatova

PDF

TL;DR

This paper introduces ContrAR, a benchmark dataset with real-world AR videos to evaluate vision-language models' robustness against virtual content manipulation and contradiction in augmented reality environments.

Contribution

The work presents a new benchmark, ContrAR, and evaluates 11 vision-language models, highlighting current limitations and challenges in detecting adversarial virtual content in AR.

Findings

01

Current VLMs show reasonable understanding of contradictory content

02

Detection accuracy and latency balance remains challenging

03

Room for improvement in adversarial content reasoning in AR

Abstract

Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user's view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.