Gazeify Then Voiceify: Physical Object Referencing Through Gaze and Voice Interaction with Displayless Smart Glasses

Zheng Zhang; Mengjie Yu; Tianyi Wang; Kashyap Todi; Ajoy Savio Fernandes; Yue Liu; Haijun Xia; Tovi Grossman; Tanya Jonker

arXiv:2601.19281·cs.HC·January 28, 2026

Gazeify Then Voiceify: Physical Object Referencing Through Gaze and Voice Interaction with Displayless Smart Glasses

Zheng Zhang, Mengjie Yu, Tianyi Wang, Kashyap Todi, Ajoy Savio Fernandes, Yue Liu, Haijun Xia, Tovi Grossman, Tanya Jonker

PDF

Open Access

TL;DR

This paper presents Gazeify then Voiceify, a novel multimodal system enabling physical object referencing through gaze and voice interactions on displayless smart glasses, enhancing user interaction without visual feedback.

Contribution

We introduce a new multimodal approach combining gaze and voice for object referencing on displayless smart glasses, integrating advanced vision and language models.

Findings

01

Participants achieved 53% correct gaze selection

02

Voice disambiguation corrected 58% of errors

03

System rated as likable, useful, and easy to use

Abstract

Smart glasses enhance interactions with the environment by using head-mounted cameras to observe the user's viewpoint, but lack the visual feedback used for common interactions. We introduce Gazeify then Voiceify, a multimodal approach allowing object selection via gaze and voice using displayless smart glasses. Users can select a physical object with their gaze, and the system generates a digital mask and a voice description of the object's semantics. Users can further correct errors through free-form conversation. To demonstrate our approach, we develop an interactive system by integrating advanced object segmentation and detection with a vision-language model. User studies reveal that participants achieve correct gaze selection in 53% of the task trials and use voice disambiguation to correct 58% of the remaining errors. Participants also rated the system as likable, useful, and easy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Interactive and Immersive Displays · Social Robot Interaction and HRI