Efficient Multimodal Neural Networks for Trigger-less Voice Assistants
Sai Srujana Buddi, Utkarsh Oggy Sarawgi, Tashweena Heeramun, Karan, Sawnhey, Ed Yanosik, Saravana Rathinam, Saurabh Adya

TL;DR
This paper introduces a neural network-based multimodal fusion system for trigger-less voice assistants on smartwatches, improving accuracy, adaptability, and deployment efficiency in diverse environments.
Contribution
It presents a novel neural network approach for audio-gesture fusion that outperforms heuristic methods in accuracy, scalability, and deployment on low-power devices.
Findings
Enhanced temporal understanding of audio-gesture data
High accuracy in diverse environments
Lightweight model suitable for smartwatches
Abstract
The adoption of multimodal interactions by Voice Assistants (VAs) is growing rapidly to enhance human-computer interactions. Smartwatches have now incorporated trigger-less methods of invoking VAs, such as Raise To Speak (RTS), where the user raises their watch and speaks to VAs without an explicit trigger. Current state-of-the-art RTS systems rely on heuristics and engineered Finite State Machines to fuse gesture and audio data for multimodal decision-making. However, these methods have limitations, including limited adaptability, scalability, and induced human biases. In this work, we propose a neural network based audio-gesture multimodal fusion system that (1) Better understands temporal correlation between audio and gesture data, leading to precise invocations (2) Generalizes to a wide range of environments and scenarios (3) Is lightweight and deployable on low-power devices, such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Music and Audio Processing
