WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables
Zhaojiang Lin, Yong Xu, Kai Sun, Jing Zheng, Yin Huang, Surya Teja Appini, Krish Narang, Renjie Tao, Ishan Kapil Jain, Siddhant Arora, Ruizhi Li, Yiteng Huang, Kaushik Patnaik, Wenfang Xu, Suwon Shon, Yue Liu, Ahmed A Aly, Anuj Kumar, Florian Metze, Xin Luna Dong

TL;DR
WearVox introduces a comprehensive egocentric audio benchmark for wearable voice assistants, capturing real-world challenges like noise, motion, and micro-interactions to evaluate and improve model robustness.
Contribution
This paper presents WearVox, the first benchmark specifically designed for realistic wearable scenarios, including multi-channel egocentric audio and diverse tasks, filling a gap in existing evaluation methods.
Findings
Most SLLMs achieve 29-59% accuracy on WearVox
Multi-channel audio improves robustness against noise
Performance drops significantly in outdoor noisy environments
Abstract
Wearable devices such as AI glasses are transforming voice assistants into always-available, hands-free collaborators that integrate seamlessly with daily life, but they also introduce challenges like egocentric audio affected by motion and noise, rapid micro-interactions, and the need to distinguish device-directed speech from background conversations. Existing benchmarks largely overlook these complexities, focusing instead on clean or generic conversational audio. To bridge this gap, we present WearVox, the first benchmark designed to rigorously evaluate voice assistants in realistic wearable scenarios. WearVox comprises 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks including Search-Grounded QA, Closed-Book QA, Side-Talk Rejection, Tool Calling, and Speech Translation, spanning a wide range of indoor and outdoor environments and…
Peer Reviews
Decision·ICLR 2026 Poster
- First benchmark aimed squarely at wearables with egocentric, multi-mic audio, diverse indoor/outdoor scenes, and explicit side-talk; prior suites largely miss these factors. - Both open and proprietary SLLMs; headline finding: most real-time SLLMs land ~29–59% on WearVox, highlighting difficulty. - Five tasks with clean input/output definitions. - Support multi-channel processing, for testing the SLLMs.
- Some reporting is aggregate. More per-environment/per-distance breakdowns (beyond the figures) would make failure modes easier to act on. - "Thinking" boosts scores but increases TTFT substantially, this deserves heavier emphasis for wearables. - A careful proofreading pass is needed to improve clarity as well as typos. - No examples to listen.
1. This paper is a solid contribution to benchmark real-world AI assistant applications in a wearable setting, such data is very hard to find and is expensive to manually curate, script, and collect. It is a first of a kind dataset for this emerging setting in HCI and human-AI interfaces. 1. The baselining is done for a wide range of commercial models, and is done across settings and tasks that matter for this wearable setting. 1. "Side-Talk Rejection" is arguably the single most important and
1. For the custom trained models, it is not clear if there is data leakage? It is also not clear from the paper what data was used to train these models. Presumably the baseline models were also trained with noise augmentation as is standard with commercial-grade speech models. 1. There doesn't seem to be an easy way to explore the benchmark. Perhaps this is a way to prevent leakage, but it would be great to be able to explore this dataset in an interactive way. 1. It is hard to get a sense of
(1) WearVox represents a significant and original data contribution. It is, to the best of current knowledge, the first large-scale dataset that combines egocentric RGB video, binaural recordings, and contact microphone signals in naturalistic scenarios. This unique combination allows for studying wearable perception in realistic settings, including interactions and self-generated sounds that traditional datasets cannot capture. (2) The dataset’s coverage of multiple downstream multimodal tasks—
(1) The WearVoxNet model, while competent, does not introduce fundamentally new architectures or fusion strategies. It primarily builds upon established audio-visual encoder paradigms, combining features through standard cross-attention or concatenation methods. As a result, its novelty lies more in the dataset than in methodological innovation. (2) The paper does not explore cross-dataset generalization, such as training on WearVox and evaluating on existing benchmarks like Ego4D or AVD. Such e
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Speech Recognition and Synthesis · Social Robot Interaction and HRI
