Gazeify Then Voiceify: Physical Object Referencing Through Gaze and Voice Interaction with Displayless Smart Glasses
Zheng Zhang, Mengjie Yu, Tianyi Wang, Kashyap Todi, Ajoy Savio Fernandes, Yue Liu, Haijun Xia, Tovi Grossman, Tanya Jonker

TL;DR
This paper presents Gazeify then Voiceify, a novel multimodal system enabling physical object referencing through gaze and voice interactions on displayless smart glasses, enhancing user interaction without visual feedback.
Contribution
We introduce a new multimodal approach combining gaze and voice for object referencing on displayless smart glasses, integrating advanced vision and language models.
Findings
Participants achieved 53% correct gaze selection
Voice disambiguation corrected 58% of errors
System rated as likable, useful, and easy to use
Abstract
Smart glasses enhance interactions with the environment by using head-mounted cameras to observe the user's viewpoint, but lack the visual feedback used for common interactions. We introduce Gazeify then Voiceify, a multimodal approach allowing object selection via gaze and voice using displayless smart glasses. Users can select a physical object with their gaze, and the system generates a digital mask and a voice description of the object's semantics. Users can further correct errors through free-form conversation. To demonstrate our approach, we develop an interactive system by integrating advanced object segmentation and detection with a vision-language model. User studies reveal that participants achieve correct gaze selection in 53% of the task trials and use voice disambiguation to correct 58% of the remaining errors. Participants also rated the system as likable, useful, and easy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Interactive and Immersive Displays · Social Robot Interaction and HRI
