(Computer) Vision in Action: Comparing Remote Sighted Assistance and a Multimodal Voice Agent in Inspection Sequences
Damien Rudaz, Barbara Nino Carreras, Sara Merlino, Brian L. Due, Barry Brown

TL;DR
This study compares human remote sighted assistance with multimodal voice agents in inspection tasks, revealing that current agents lack key vision-based proactive practices essential for effective collaboration.
Contribution
It provides a detailed analysis of the differences between human and AI assistance in visual inspection, highlighting the limitations of current multimodal voice agents.
Findings
Human assistance involves proactive vision-based actions.
Current voice agents lack environmentally occasioned vision actions.
Proactivity in assistance is crucial for effective collaboration.
Abstract
Does human-AI assistance unfold in the same way as human-human assistance? This research explores what can be learned from the expertise of blind individuals and sighted volunteers to inform the design of multimodal voice agents and address the enduring challenge of proactivity. Drawing on granular analysis of two representative fragments from a larger corpus, we contrast the practices co-produced by an experienced human remote sighted assistant and a blind participant-as they collaborate to find a stain on a blanket over the phone-with those achieved when the same participant worked with a multimodal voice agent on the same task, a few moments earlier. This comparison enables us to specify precisely which fundamental proactive practices the agent did not enact in situ. We conclude that, so long as multimodal voice agents cannot produce environmentally occasioned vision-based actions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Social Robot Interaction and HRI · Tactile and Sensory Interactions
