WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang; Zhuangqun Huang; Yiwei Liao; Sagar Ravi Bhavsar; Amogh Param; Tammy Stark; Adel Ahmadyan; Xiao Yang; Jiaqi Wang; Ahsan Abdullah; Giang Nguyen; Akil Iyer; David Hall; Elissa Li; Shane Moon; Nicolas Scheffer; Kirmani Ahmed; Babak Damavandi; Rakesh Wanga; Anuj Kumar; Rohit Patel; and Xin Luna Dong

arXiv:2511.22154·cs.AI·December 3, 2025

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang, Zhuangqun Huang, Yiwei Liao, Sagar Ravi Bhavsar, Amogh Param, Tammy Stark, Adel Ahmadyan, Xiao Yang, Jiaqi Wang, Ahsan Abdullah, Giang Nguyen, Akil Iyer, David Hall, Elissa Li, Shane Moon, Nicolas Scheffer, Kirmani Ahmed, Babak Damavandi, Rakesh Wanga, Anuj Kumar

PDF

Open Access 1 Datasets

TL;DR

WearVQA is a new benchmark designed to evaluate the visual question answering capabilities of AI models on wearable devices in real-world, egocentric scenarios with diverse challenges and image qualities.

Contribution

This paper introduces WearVQA, the first benchmark focusing on egocentric, real-world wearable scenarios, with a comprehensive dataset and evaluation framework for VQA tasks.

Findings

01

Open-source models achieve 24-52% accuracy on WearVQA.

02

Performance drops significantly on low-quality images and reasoning tasks.

03

WearVQA presents a challenging benchmark for advancing wearable AI systems.

Abstract

We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-model AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

kaimeta/wearables_benchmarks
dataset· 26 dl
26 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Social Robot Interaction and HRI