Automatic audiovisual synchronisation for ultrasound tongue imaging
Aciel Eshky, Joanne Cleland, Manuel Sam Ribeiro, Eleanor Sugden, Korin, Richmond, Steve Renals

TL;DR
This paper presents a self-supervised neural network approach for automatic synchronization of ultrasound tongue imaging and speech audio, achieving over 92% accuracy and outperforming hardware synchronization in clinical datasets.
Contribution
The study introduces a neural network-based method for post-hoc audiovisual synchronization, effective across diverse domains and unreliable hardware conditions.
Findings
Achieved >92.4% accuracy on in-domain data
Users preferred model output over hardware synchronization 79.3% of the time
Demonstrated generalization to new clinical datasets
Abstract
Ultrasound tongue imaging is used to visualise the intra-oral articulators during speech production. It is utilised in a range of applications, including speech and language therapy and phonetics research. Ultrasound and speech audio are recorded simultaneously, and in order to correctly use this data, the two modalities should be correctly synchronised. Synchronisation is achieved using specialised hardware at recording time, but this approach can fail in practice resulting in data of limited usability. In this paper, we address the problem of automatically synchronising ultrasound and audio after data collection. We first investigate the tolerance of expert ultrasound users to synchronisation errors in order to find the thresholds for error detection. We use these thresholds to define accuracy scoring boundaries for evaluating our system. We then describe our approach for automatic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
