Audio-Visual Wake Word Spotting System For MISP Challenge 2021

Yanguang Xu; Jianwei Sun; Yang Han; Shuaijiang Zhao; Chaoyang Mei,; Tingwei Guo; Shuran Zhou; Chuandong Xie; Wei Zou; Xiangang Li; Shuran Zhou,; Chuandong Xie; Wei Zou; Xiangang Li

arXiv:2204.08686·cs.SD·April 21, 2022

Audio-Visual Wake Word Spotting System For MISP Challenge 2021

Yanguang Xu, Jianwei Sun, Yang Han, Shuaijiang Zhao, Chaoyang Mei,, Tingwei Guo, Shuran Zhou, Chuandong Xie, Wei Zou, Xiangang Li, Shuran Zhou,, Chuandong Xie, Wei Zou, Xiangang Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multimodal system combining audio enhancement, data augmentation, and visual cues with attention-based fusion to improve far-field wake word detection robustness in challenging environments.

Contribution

The novel integration of audio enhancement, visual features, and attention-based fusion techniques for robust wake word spotting in far-field scenarios.

Findings

01

Achieved a final score of 0.091 in the MISP Challenge 2021.

02

Significant performance improvement using focal loss for model fine-tuning.

03

Effective multimodal fusion of audio and visual data enhances robustness.

Abstract

This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based network which can be employed for fusion, such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

athena-team/athena
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Adaptive Filtering Techniques