Multichannel Voice Trigger Detection Based on   Transform-average-concatenate

Takuya Higuchi; Avamarie Brueggeman; Masood Delfarah; Stephen Shum

arXiv:2309.16036·eess.AS·February 15, 2024

Multichannel Voice Trigger Detection Based on Transform-average-concatenate

Takuya Higuchi, Avamarie Brueggeman, Masood Delfarah, Stephen Shum

PDF

Open Access

TL;DR

This paper introduces a multichannel voice trigger detection system using a transform-average-concatenate (TAC) block that leverages all available channels, improving detection accuracy over traditional single-channel methods.

Contribution

The work proposes a novel multichannel acoustic model with a modified TAC block that utilizes all channels, enhancing voice trigger detection performance in multi-speaker scenarios.

Findings

01

Achieves up to 30% reduction in false rejection rate.

02

Effectively utilizes multichannel information for improved VT.

03

Demonstrates superiority over baseline channel selection methods.

Abstract

Voice triggering (VT) enables users to activate their devices by just speaking a trigger phrase. A front-end system is typically used to perform speech enhancement and/or separation, and produces multiple enhanced and/or separated signals. Since conventional VT systems take only single-channel audio as input, channel selection is performed. A drawback of this approach is that unselected channels are discarded, even if the discarded channels could contain useful information for VT. In this work, we propose multichannel acoustic models for VT, where the multichannel output from the frond-end is fed directly into a VT model. We adopt a transform-average-concatenate (TAC) block and modify the TAC block by incorporating the channel from the conventional channel selection so that the model can attend to a target speaker when multiple speakers are present. The proposed approach achieves up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Blind Source Separation Techniques