Jointly Optimizing Sensing Pipelines for Multimodal Mixed Reality Interaction
Darshana Rathnayake, Ashen de Silva, Dasun Puwakdandawa, Lakmal, Meegahapola, Archan Misra, Indika Perera

TL;DR
This paper introduces a sensor fusion architecture for multimodal mixed reality interaction that dynamically balances model complexity across visual, speech, and gestural inputs to reduce latency and improve accuracy in resource-constrained devices.
Contribution
It presents a reconfigurable, cross-modal sensor fusion system that optimizes model complexity based on context, significantly reducing latency and enhancing comprehension accuracy.
Findings
3-fold reduction in comprehension latency
10-15% increase in accuracy
Model combination performance varies with context
Abstract
Natural human interactions for Mixed Reality Applications are overwhelmingly multimodal: humans communicate intent and instructions via a combination of visual, aural and gestural cues. However, supporting low-latency and accurate comprehension of such multimodal instructions (MMI), on resource-constrained wearable devices, remains an open challenge, especially as the state-of-the-art comprehension techniques for each individual modality increasingly utilize complex Deep Neural Network models. We demonstrate the possibility of overcoming the core limitation of latency--vs.--accuracy tradeoff by exploiting cross-modal dependencies -- i.e., by compensating for the inferior performance of one model with an increased accuracy of more complex model of a different modality. We present a sensor fusion architecture that performs MMI comprehension in a quasi-synchronous fashion, by fusing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
