A comparative study of estimating articulatory movements from phoneme sequences and acoustic features
Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh

TL;DR
This study compares methods for estimating speech articulatory movements from acoustic signals and phoneme sequences, revealing that linguistic information alone can predict articulatory motion with high accuracy, especially when combining multiple input types.
Contribution
It demonstrates that phoneme sequences alone can effectively estimate articulatory movements, and combining acoustic and phoneme data improves prediction accuracy.
Findings
Attention network with phoneme sequences achieves high correlation (0.81)
Estimation from acoustic signals yields a correlation of 0.85
Combining acoustic and phoneme data increases correlation to 0.88
Abstract
Unlike phoneme sequences, movements of speech articulators (lips, tongue, jaw, velum) and the resultant acoustic signal are known to encode not only the linguistic message but also carry para-linguistic information. While several works exist for estimating articulatory movement from acoustic signals, little is known to what extent articulatory movements can be predicted only from linguistic information, i.e., phoneme sequence. In this work, we estimate articulatory movements from three different input representations: R1) acoustic signal, R2) phoneme sequence, R3) phoneme sequence with timing information. While an attention network is used for estimating articulatory movement in the case of R2, BLSTM network is used for R1 and R3. Experiments with ten subjects' acoustic-articulatory data reveal that the estimation techniques achieve an average correlation coefficient of 0.85, 0.81, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
