Efficient Multimodal Neural Networks for Trigger-less Voice Assistants

Sai Srujana Buddi; Utkarsh Oggy Sarawgi; Tashweena Heeramun; Karan; Sawnhey; Ed Yanosik; Saravana Rathinam; Saurabh Adya

arXiv:2305.12063·cs.LG·May 23, 2023·2 cites

Efficient Multimodal Neural Networks for Trigger-less Voice Assistants

Sai Srujana Buddi, Utkarsh Oggy Sarawgi, Tashweena Heeramun, Karan, Sawnhey, Ed Yanosik, Saravana Rathinam, Saurabh Adya

PDF

Open Access

TL;DR

This paper introduces a neural network-based multimodal fusion system for trigger-less voice assistants on smartwatches, improving accuracy, adaptability, and deployment efficiency in diverse environments.

Contribution

It presents a novel neural network approach for audio-gesture fusion that outperforms heuristic methods in accuracy, scalability, and deployment on low-power devices.

Findings

01

Enhanced temporal understanding of audio-gesture data

02

High accuracy in diverse environments

03

Lightweight model suitable for smartwatches

Abstract

The adoption of multimodal interactions by Voice Assistants (VAs) is growing rapidly to enhance human-computer interactions. Smartwatches have now incorporated trigger-less methods of invoking VAs, such as Raise To Speak (RTS), where the user raises their watch and speaks to VAs without an explicit trigger. Current state-of-the-art RTS systems rely on heuristics and engineered Finite State Machines to fuse gesture and audio data for multimodal decision-making. However, these methods have limitations, including limited adaptability, scalability, and induced human biases. In this work, we propose a neural network based audio-gesture multimodal fusion system that (1) Better understands temporal correlation between audio and gesture data, leading to precise invocations (2) Generalizes to a wide range of environments and scenarios (3) Is lightweight and deployable on low-power devices, such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Music and Audio Processing