Understanding Gesture and Speech Multimodal Interactions for Manipulation Tasks in Augmented Reality Using Unconstrained Elicitation
Adam S. Williams, Francisco R. Ortega

TL;DR
This study investigates how users naturally combine speech and gestures in augmented reality for object manipulation, providing insights into timing, syntax, and interaction patterns to improve multimodal system design.
Contribution
It offers a detailed analysis of unconstrained speech and gesture interactions in AR, including timing windows and common proposal patterns, with practical recommendations for system improvements.
Findings
Gestures often precede speech by 81 ms.
Gesture strokes are within 10 ms of speech start, indicating alignment.
Variation in hand posture and syntax causes proposal disagreements.
Abstract
This research establishes a better understanding of the syntax choices in speech interactions and of how speech, gesture, and multimodal gesture and speech interactions are produced by users in unconstrained object manipulation environments using augmented reality. The work presents a multimodal elicitation study conducted with 24 participants. The canonical referents for translation, rotation, and scale were used along with some abstract referents (create, destroy, and select). In this study time windows for gesture and speech multimodal interactions are developed using the start and stop times of gestures and speech as well as the stoke times for gestures. While gestures commonly precede speech by 81 ms we find that the stroke of the gesture is commonly within 10 ms of the start of speech. Indicating that the information content of a gesture and its co-occurring speech are well…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
