Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality
Yanming Xiu, Zhengyuan Jiang, Neil Zhenqiang Gong, Maria Gorlatova

TL;DR
This paper introduces ContrAR, a benchmark dataset with real-world AR videos to evaluate vision-language models' robustness against virtual content manipulation and contradiction in augmented reality environments.
Contribution
The work presents a new benchmark, ContrAR, and evaluates 11 vision-language models, highlighting current limitations and challenges in detecting adversarial virtual content in AR.
Findings
Current VLMs show reasonable understanding of contradictory content
Detection accuracy and latency balance remains challenging
Room for improvement in adversarial content reasoning in AR
Abstract
Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user's view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
