3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces
L\'aszl\'o T\'oth, Amin Honarmandi Shandiz

TL;DR
This paper explores the use of 3D convolutional neural networks for ultrasound-based silent speech interfaces, demonstrating that 3D CNNs outperform traditional CNN+LSTM models in capturing tongue movement sequences.
Contribution
It introduces a novel application of 3D CNNs for silent speech interfaces, showing their effectiveness over CNN+LSTM architectures for processing ultrasound video sequences.
Findings
3D CNNs outperform CNN+LSTM models in speech reconstruction accuracy.
Decomposed spatial and temporal convolutions are effective for ultrasound video analysis.
3D CNNs are a promising alternative for silent speech interface systems.
Abstract
Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Currently, deep neural networks are the most successful technology for this task. The efficient solution requires methods that do not simply process single images, but are able to extract the tongue movement information from a sequence of video frames. One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs). Here, we experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, we apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. We find experimentally that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMemory Network
