3D Convolutional Neural Networks for Ultrasound-Based Silent Speech   Interfaces

L\'aszl\'o T\'oth; Amin Honarmandi Shandiz

arXiv:2104.11532·cs.SD·April 26, 2021

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

L\'aszl\'o T\'oth, Amin Honarmandi Shandiz

PDF

TL;DR

This paper explores the use of 3D convolutional neural networks for ultrasound-based silent speech interfaces, demonstrating that 3D CNNs outperform traditional CNN+LSTM models in capturing tongue movement sequences.

Contribution

It introduces a novel application of 3D CNNs for silent speech interfaces, showing their effectiveness over CNN+LSTM architectures for processing ultrasound video sequences.

Findings

01

3D CNNs outperform CNN+LSTM models in speech reconstruction accuracy.

02

Decomposed spatial and temporal convolutions are effective for ultrasound video analysis.

03

3D CNNs are a promising alternative for silent speech interface systems.

Abstract

Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Currently, deep neural networks are the most successful technology for this task. The efficient solution requires methods that do not simply process single images, but are able to extract the tongue movement information from a sequence of video frames. One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs). Here, we experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, we apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. We find experimentally that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMemory Network