SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
Zhuohang Jiang, Xu Yuan, Haohao Qu, Shanru Lin, Kanglong Liu, Wenqi Fan, Qing Li

TL;DR
This paper introduces SUPERGLASSES, a real-world VQA benchmark for smart glasses, and proposes SUPERLENS, a retrieval-augmented multimodal agent that significantly improves performance.
Contribution
It provides the first comprehensive, real-world VQA dataset for smart glasses and develops SUPERLENS, a novel multimodal agent that outperforms existing models.
Findings
26 VLMs evaluated reveal significant performance gaps.
SUPERLENS achieves state-of-the-art results, surpassing GPT-4o by 2.19%.
The dataset and benchmark are publicly available for further research.
Abstract
The rapid advancement of AI-powered smart glasses-one of the hottest wearable devices-has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPER- GLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
