A Study of Designing Compact Audio-Visual Wake Word Spotting System   Based on Iterative Fine-Tuning in Neural Network Pruning

Hengshun Zhou; Jun Du; Chao-Han Huck Yang; Shifu Xiong; Chin-Hui Lee

arXiv:2202.08509·cs.SD·February 18, 2022

A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning

Hengshun Zhou, Jun Du, Chao-Han Huck Yang, Shifu Xiong, Chin-Hui Lee

PDF

Open Access

TL;DR

This paper proposes an audio-visual wake word spotting system that leverages visual lip information and neural network pruning to improve performance and reduce complexity in noisy environments, suitable for TV applications.

Contribution

It introduces an iterative fine-tuning neural network pruning method based on the lottery ticket hypothesis for compact multi-modal wake word systems.

Findings

01

Audio-visual system outperforms single-modality systems in noisy conditions.

02

LTH-IF pruning significantly reduces model size and computation without performance loss.

03

The approach is effective for real-world TV wake-up scenarios.

Abstract

Audio-only-based wake word spotting (WWS) is challenging under noisy conditions due to environmental interference in signal transmission. In this paper, we investigate on designing a compact audio-visual WWS system by utilizing visual information to alleviate the degradation. Specifically, in order to use visual information, we first encode the detected lips to fixed-size vectors with MobileNet and concatenate them with acoustic features followed by the fusion network for WWS. However, the audio-visual model based on neural networks requires a large footprint and a high computational complexity. To meet the application requirements, we introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF), to the single-modal and multi-modal models, respectively. Tested on our in-house corpus for audio-visual WWS in a home TV scene, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization

MethodsPruning