TL;DR
This paper introduces a neural network approach to automatically synchronise speech audio with ultrasound tongue videos post-recording, improving efficiency in child speech therapy by reducing manual effort.
Contribution
It presents a novel two-stream neural network model that learns cross-modal embeddings to synchronise audio and ultrasound signals after recording, addressing hardware failure issues.
Findings
Achieves 82.9% correct synchronisation on unseen data
More effective on natural speech variations than directed phonations
Reduces manual synchronisation workload significantly
Abstract
Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the signals post hoc. To address this problem, we employ a two-stream neural network which exploits the correlation between the two modalities to find the offset. We train our model on recordings from 69 speakers, and show that it correctly synchronises 82.9% of test utterances from unseen therapy sessions and unseen speakers, thus considerably reducing the number of utterances to be manually synchronised. An analysis of model performance on the test utterances shows that directed phone…
| Stream | Conv1 | Conv2 | Conv3 | Full4 | Full5 |
|---|---|---|---|---|---|
| Visual | 23x5x5 | 64x5x5 | 128x5x5 | 64 | 64 |
| x2 pool | x2 pool | x2 pool | |||
| Audio | 23x3x3 | 64x3x3 | 128x3x3 | 64 | 64 |
| x2 pool | x2 pool |
| Words (A) | Non-words (B) | Sentence (C) | Articulatory (D) | Conversation (F) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Test Set | N | Acc | N | Acc | N | Acc | N | Acc | N | Acc |
| UXTD | 108 | 88.9% | 22 | 86.4% | 0 | - | 325 | 55.4% | 0 | - |
| UXSSD | 307 | 88.6% | 20 | 65.0% | 58 | 94.8% | 11 | 54.5% | 0 | - |
| UPX | 499 | 92.4% | 16 | 100.0% | 128 | 93.8% | 4 | 100.0% | 4 | 75.0% |
| All | 914 | 90.7% | 58 | 82.8% | 186 | 94.1% | 340 | 55.9% | 4 | 75.0% |
| Utterance Type | N | Acc | Discrepancy | |
| Words (A) | 914 | 90.7% | 1 | 102 ms |
| Non-words (B) | 58 | 82.8% | 2 | 39 ms |
| Sentence (C) | 186 | 94.1% | 16 | 150 ms |
| Articulatory (D) | 340 | 55.9% | 129 | 408 ms |
| Conversation (F) | 4 | 75.0% | 87 | 141 ms |
| All | 1502 | 82.9% | 32 | 223 ms |
| A, B, C and F | 1162 | 91.2% | 3 | 112 ms |
| Test Set | N | Acc | Discrepancy | |
|---|---|---|---|---|
| UXTD, new speakers | 455 | 64.8% | 97 | 357 ms |
| UXSSD, new sessions | 126 | 82.5% | 19 | 160 ms |
| UXSSD, new speaker | 270 | 89.6% | 9 | 135 ms |
| UPX, new sessions | 306 | 91.2% | 3 | 40 ms |
| UPX, new speaker | 345 | 94.2% | 2 | 123 ms |
| All | 1502 | 82.9% | 32 | 223 ms |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Synchronising audio and ultrasound by learning cross-modal embeddings
Abstract
Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the signals post hoc. To address this problem, we employ a two-stream neural network which exploits the correlation between the two modalities to find the offset. We train our model on recordings from 69 speakers, and show that it correctly synchronises 82.9% of test utterances from unseen therapy sessions and unseen speakers, thus considerably reducing the number of utterances to be manually synchronised. An analysis of model performance on the test utterances shows that directed phone articulations are more difficult to automatically synchronise compared to utterances containing natural variation in speech such as words, sentences, or conversations.
Index Terms: Audiovisual synchronisation, speech audio & ultrasound, machine learning, neural-networks, self-supervision.
1 Introduction
Ultrasound tongue imaging (UTI) is a non-invasive way of observing the vocal tract during speech production [1]. Instrumental speech therapy relies on capturing ultrasound videos of the patient’s tongue simultaneously with their speech audio in order to provide a diagnosis, design treatments, and measure therapy progress [2]. The two modalities must be correctly synchronised, with a minimum shift of 45ms if the audio leads and 125ms if the audio lags, based on synchronisation standards for broadcast audiovisual signals [3]. Errors beyond this range can render the data unusable – indeed, synchronisation errors do occur, resulting in significant wasted effort if not corrected. No mechanism currently exists to automatically correct these errors, and although manual synchronisation is possible in the presence of certain audiovisual cues such as stop consonants [4], it is time consuming and tedious.
In this work, we exploit the correlation between the two modalities to synchronise them. We utilise a two-stream neural network architecture for the task [5], using as our only source of supervision pairs of ultrasound and audio segments which have been automatically generated and labelled as positive (correctly synchronised) or negative (randomly desynchronised); a process known as self-supervision [6]. We demonstrate how this approach enables us to correctly synchronise the majority of utterances in our test set, and in particular, those exhibiting natural variation in speech.
Section 2 reviews existing approaches for audiovisual synchronisation, and describes the challenges specifically associated with UTI data, compared with lip videos for which automatic synchronisation has been previously attempted. Section 3 describes our approach. Section 4 describes the data we use, including data preprocessing and positive and negative sample creation using a self-supervision strategy. Section 5 describes our experiments, followed by an analysis of the results. We conclude with a summary and future directions in Section 6111Code available at: https://github.com/aeshky/ultrasync.
2 Background
Ultrasound and audio are recorded using separate components, and hardware synchronisation is achieved by translating information from the visual signal into audio at recording time. Specifically, for every ultrasound frame recorded, the ultrasound beam-forming unit releases a pulse signal, which is translated by an external hardware synchroniser into an audio pulse signal and captured by the sound card [7, 8]. Synchronisation is achieved by aligning the ultrasound frames with the audio pulse signal, which is already time-aligned with the speech audio [9].
Hardware synchronisation can fail for a number of reasons. The synchroniser is an external device which needs to be correctly connected and operated by therapists. Incorrect use can lead to missing the pulse signal, which would cause synchronisation to fail for entire therapy sessions [10]. Furthermore, low-quality sound cards report an approximate, rather than the exact, sample rate which leads to errors in the offset calculation [9]. There is currently no recovery mechanism for when synchronisation fails, and to the best of our knowledge, there has been no prior work on automatically correcting the synchronisation error between ultrasound tongue videos and audio. There is, however, some prior work on synchronising lip movement with audio which we describe next.
2.1 Audiovisual synchronisation for lip videos
Speech audio is generated by articulatory movement and is therefore fundamentally correlated with other manifestations of this movement, such as lip or tongue videos [11]. An alternative to the hardware approach is to exploit this correlation to find the offset. Previous approaches have investigated the effects of using different representations and feature extraction techniques on finding dimensions of high correlation [12, 13, 14]. More recently, neural networks, which learn features directly from input, have been employed for the task. SyncNet [5] uses a two-stream neural network and self-supervision to learn cross-modal embeddings, which are then used to synchronise audio with lip videos. It achieves near perfect accuracy (99) using manual evaluation where lip-sync error is not detectable to a human. It has since been extended to use different sample creation methods for self-supervision [6, 15] and different training objectives [15]. We adopt the original approach [5], as it is both simpler and significantly less expensive to train than the more recent variants.
2.2 Lip videos vs. ultrasound tongue imaging (UTI)
Videos of lip movement can be obtained from various sources including TV, films, and YouTube, and are often cropped to include only the lips [5]. UTI data, on the other hand, is recorded in clinics by trained therapists [16]. An ultrasound probe placed under the chin of the patient captures the midsaggital view of their oral cavity as they speak. UTI data consists of sequences of 2D matrices of raw ultrasound reflection data, which can be interpreted as greyscale images [16]. There are several challenges specifically associated with UTI data compared with lip videos, which can potentially lower the performance of models relative to results reported on lip video data. These include:
Poor image quality: Ultrasound data is noisy, containing arbitrary high-contrast edges, speckle noise, artefacts, and interruptions to the tongue’s surface [1, 17, 18]. The oral cavity is not entirely visible, missing the lips, the palate, and the pharyngeal wall, and visually interpreting the data requires specialised training. In contrast, videos of lip movement are of much higher quality and suffer from none of these issues.
Probe placement variation: Surfaces that are orthogonal to the ultrasound beam image better than those at an angle. Small shifts in probe placement during recording lead to high variation between otherwise similar tongue shapes [1, 19, 18]. In contrast, while the scaling and rotations of lip videos lead to variation, they do not lead to a degradation in image quality.
Inter-speaker variation: Age and physiology affect the quality of ultrasound data, and subjects with smaller vocal tracts and less tissue fat image better [1, 18]. Dryness in the mouth, as a result of nervousness during speech therapy, leads to poor imaging. While inter-speaker variation is expected in lip videos, again, the variation does not lead to quality degradation.
Limited amount of data: Existing UTI datasets are considerably smaller than lip movement datasets. Consider for example VoxCeleb and VoxCeleb2 used to train SyncNet [5, 15], which together contain 1 million utterances from 7,363 identities [20, 21]. In contrast, the UltraSuite repository (used in this work) contains 13,815 spoken utterances from 86 identities.
Uncorrelated segments: Speech therapy data contains interactions between the therapist and patient. The audio therefore contains speech from both speakers, while the ultrasound captures only the patient’s tongue [16]. As a result, parts of the recordings will consist of completely uncorrelated audio and ultrasound. This issue is similar to that of dubbed voices in lip videos [5], but is more prevalent in speech therapy data.
3 Model
We adopt the approach in [5], modifying it to synchronise audio with UTI data. Our model, UltraSync, consists of two streams: the first takes as input a short segment of ultrasound and the second takes as input the corresponding audio. Both inputs are high-dimensional and are of different sizes. The objective is to learn a mapping from the inputs to a pair of low-dimensional vectors of the same length, such that the Euclidean distance between the two vectors is small when they correlate and large otherwise [22, 23]. This model can be viewed as an extension of a siamese neural network [24] but with two asymmetrical streams and no shared parameters. Figure 1 illustrates the main architecture. The visual data (ultrasound) and audio data (MFCC), which have different shapes, are mapped to low dimensional embeddings (visual) and (audio) of the same size:
[TABLE]
The model is trained using a contrastive loss function [22, 23], , which minimises the Euclidean distance between and for positive pairs (), and maximises it for negative pairs (), for a number of training samples :
[TABLE]
Given a pair of ultrasound and audio segments we can calculate the distance between them using our model. To predict the synchronisation offset for an utterance, we consider a discretised set of candidate offsets, calculate the average distance for each across utterance segments, and select the one with the minimum average distance. The candidate set is independent of the model, and is chosen based on task knowledge (Section 5).
4 Data
For our experiments, we select a dataset whose utterances have been correctly synchronised at recording time. This allows us to control how the model is trained and verify its performance using ground truth synchronisation offsets. We use UltraSuite222http://www.ultrax-speech.org/ultrasuite: a repository of ultrasound and acoustic data gathered from child speech therapy sessions [16]. We used all three datasets from the repository: UXTD (recorded with typically developing children), and UXSSD and UPX (recorded with children with speech sound disorders). In total, the dataset contains 13,815 spoken utterances from 86 speakers, corresponding to 35.9 hours of recordings. The utterances have been categorised by the type of task the child was given, and are labelled as: Words (A), Non-words (B), Sentence (C), Articulatory (D), Non-speech (E), or Conversations (F). See [16] for details.
Each utterance consists of 3 files: audio, ultrasound, and parameter. The audio file is a RIFF wave file, sampled at 22.05 KHz, containing the speech of the child and therapist. The ultrasound file consists of a sequence of ultrasound frames capturing the midsagittal view of the child’s tongue. A single ultrasound frame is recorded as a 2D matrix where each column represents the ultrasound reflection intensities along a single scan line. Each ultrasound frame consists of 63 scan lines of 412 data points each, and is sampled at a rate of 121.5 fps. Raw ultrasound frames can be visualised as greyscale images and can thus be interpreted as videos. The parameter file contains the synchronisation offset value (in milliseconds), determined using hardware synchronisation at recording time and confirmed by the therapists to be correct for this dataset.
4.1 Preparing the data
First, we exclude utterances of type “Non-speech” (E) from our training data (and statistics). These are coughs recorded to obtain additional tongue shapes, or swallowing motions recorded to capture a trace of the hard palate. Both of these rarely contain audible content and are therefore not relevant to our task. Next, we apply the offset, which should be positive if the audio leads and negative if the audio lags. In this dataset, the offset is always positive. We apply it by cropping the leading audio and trimming the end of the longer signal to match the duration.
To process the ultrasound more efficiently, we first reduce the frame rate from 121.5 fps to 24.3 fps by retaining 1 out of every 5 frames. We then downsample by a factor of (1, 3), shrinking the frame size from 63x412 to 63x138 using max pixel value. This retains the number of ultrasound vectors (63), but reduces the number of pixels per vector (from 412 to 138).
The final pre-preprocessing step is to remove empty regions. UltraSuite was previously anonymised by zero-ing segments of audio which contained personally identifiable information. As a preprocessing step, we remove the zero regions from audio and corresponding ultrasound. We additionally experimented with removing regions of silence using voice activity detection, but obtained a higher performance by retaining them.
4.2 Creating samples using a self-supervision strategy
To train our model we need positive and negative training pairs. The model ingests short clips from each modality of 200ms long, calculated as , where is the time window, is the number of ultrasound frames per window (5 in our case), and is the ultrasound frame rate of the utterance (24.3 fps). For each recording, we split the ultrasound into non-overlapping windows of 5 frames each. We extract MFCC features (13 cepstral coefficients) from the audio using a window length of 20ms, calculated as , and a step size of 10ms, calculated as . This give us the input sizes shown in Figure 1.
Positive samples are pairs of ultrasound windows and the corresponding MFCC frames. To create negative samples, we randomise pairings of ultrasound windows to MFCC frames within the same utterance, generating as many negative as positive samples to achieve a balanced dataset. We obtain 243,764 samples for UXTD (13.5hrs), 333,526 for UXSSD (18.5hrs), and 572,078 for UPX (31.8 hrs), or a total 1,149,368 samples (63.9hrs) which we divide into training, validation and test sets.
4.3 Dividing samples for training, validation and testing
We aim to test whether our model generalises to data from new speakers, and to data from new sessions recorded with known speakers. To simulate this, we select a group of speakers from each dataset, and hold out all of their data either for validation or for testing. Additionally, we hold out one entire session from each of the remaining speakers, and use the rest of their data for training. We aim to reserve approximately 80% of the created samples for training, 10% for validation, and 10% for testing, and select speakers and sessions on this basis.
Each speaker in UXTD recorded 1 session, but sessions are of different durations. We reserve 45 speakers for training, 5 for validation, and 8 for testing. UXSSD and UPX contain fewer speakers, but each recorded multiple sessions. We hold out 1 speaker for validation and 1 for testing from each of the two datasets. We also hold out a session from the first half of the remaining speakers for validation, and a session from the second half of the remaining speakers for testing333Held out subsets: UXTD speakers 07, 08, 12, 13, 26 for validation and speakers 30, 38, 43, 45, 47, 52, 53, 55 for testing. UXSSD speaker 01 and session ‘Mid’ (for speakers 02-04) for validation, and speaker 07 and session ‘Mid’ (for speakers 05, 06, 08) for testing. UPX speakers 01 and session ‘BL3’ (for speakers 02-10) for validation, and speaker 15 and session ‘BL3’ (for speakers 11-14 and 16-20) for testing.. This selection process results in 909,858 (pooled) samples for training (50.5hrs), 128,414 for validation (7.1hrs) and 111,096 for testing (6.2hrs). From the training set, we create shuffled batches which are balanced in the number of positive and negative samples.
5 Experiments
We select the hyper-parameters of our model empirically by tuning on the validation set (Table 1). Hyper-parameter exploration is guided by [25]. We train our model using the Adam optimiser [26] with a learning rate of 0.001, a batch size of 64 samples, and for 20 epochs. We implement learning rate scheduling which reduces the learning rate by a factor of 0.1 when the validation loss plateaus for 2 epochs.
Upon convergence, the model achieves 0.181 training loss, 0.213 validation loss, and 0.213 test loss. By placing a threshold of 0.5 on predicted distances, the model achieves 72.7% binary classification accuracy on training samples, 65.6% on validation samples, and 65.6% on test samples.
Synchronisation offset prediction: Section 3 described briefly how to use our model to predict the synchronisation offset for test utterances. To obtain a discretised set of offset candidates, we retrieve the true offsets of the training utterances, and find that they fall in the range [0, 1789] ms. We discretise this range taking 45ms steps and rendering 40 candidate values (45ms is the smaller of the absolute values of the detectability thresholds, 125 and 45 ms [3]). We bin the true offsets in the candidate set and discard empty bins, reducing the set from 40 to 24 values. We consider all 24 candidates for each test utterance. We do this by aligning the two signals according to the given candidate, then producing the non-overlapping windows of ultrasound and MFCC pairs, as we did when preparing the data. We then use our model to predict the Euclidean distance for each pair, and average the distances. Finally, we select the offset with the smallest average distance as our prediction.
Evaluation: Because the true offsets are known, we evaluate the performance of our model by computing the discrepancy between the predicted and the true offset for each utterance. If the discrepancy falls within the detectability thresholds [3] (125 45) then the prediction is correct. Random prediction (averaged over 1000 runs) yields 14.6% accuracy with a mean and standard deviation discrepancy of 328 518ms. We achieve 82.9% accuracy with a mean and standard deviation discrepancy of 32 223ms. SyncNet reports 99% accuracy on lip video synchronisation using a manual evaluation where the lip error is not detectable to a human observer [5]. However, we argue that our data is more challenging (Section 2.2).
Analysis: We analyse the performance of our model across different conditions. Table 3 shows the model accuracy broken down by utterance type. The model achieves 91.2% accuracy on utterances containing words, sentences, and conversations, all of which exhibit natural variation in speech. The model is less successful with Articulatory utterances, which contain isolated phones occurring once or repeated (e.g., “sh sh sh”). Such utterances contain subtle tongue movement, making it more challenging to correlate the visual signal with the audio. And indeed, the model finds the correct offset for only 55.9% of Articulatory utterances. A further analysis shows that 84.4% (N90) of stop consonants (e.g., “t”), which are relied upon by therapists as the most salient audiovisual synchronisation cues [4], are correctly synchronised by our model, compared to 48.6% (N140) of vowels, which contain less distinct movement and are also more challenging for therapists to synchronise.
Table 4 shows accuracy broken down by test set. The model performs better on test sets containing entirely new speakers compared with test sets containing new sessions from previously seen speakers. This is contrary to expectation but could be due to the UTI challenges (described in Section 2.2) affecting different subsets to different degrees. Table 4 shows that the model performs considerably worse on UXTD compared to other test sets (64.8% accuracy). However, a further breakdown of the results in Table 2 by test set and utterance type explains this poor performance; the majority of UXTD utterances (71%) are Articulatory utterances which the model struggles to correctly synchronise. In fact, for other utterance types (where there is a large enough sample, such as Words) performance on UXTD is on par with other test sets.
6 Conclusion
We have shown how a two-stream neural network originally designed to synchronise lip videos with audio can be used to synchronise UTI data with audio. Our model exploits the correlation between the modalities to learn cross-model embeddings which are used to find the synchronisation offset. It generalises well to held-out data, allowing us to correctly synchronise the majority of test utterances. The model is best-suited to utterances which contain natural variation in speech and least suited to those containing isolated phones, with the exception of stop consonants. Future directions include integrating the model and synchronisation offset prediction process into speech therapy software [7, 8], and using the learned embeddings for other tasks such as active speaker detection [5].
7 Acknowledgements
Supported by EPSRC Healthcare Partnerships Programme grant number EP/P02338X/1 (Ultrax2020).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Stone, “A guide to analysing tongue motion from ultrasound images,” Clinical Linguistics and Phonetics , vol. 19, no. 6-7, pp. 455–501, 2005.
- 2[2] J. Cleland, J. M. Scobbie, and A. A. Wrench, “Using ultrasound visual biofeedback to treat persistent primary speech sound disorders,” Clinical Linguistics and Phonetics , vol. 29, no. 8-10, pp. 575–597, 2015.
- 3[3] ITU-R, “Recommendation ITU-R BT.1359: Relative timing of sound and vision for broadcasting,” January 1998.
- 4[4] M. Alm and D. Behne, “Audio-visual speech experience with age influences perceived audio-visual asynchrony in speech,” The Journal of the Acoustical Society of America , vol. 134, no. 4, pp. 3001–3010, 2013.
- 5[5] J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Workshop on Multi-view Lip-reading, 13th Asian Conference on Computer Vision (ACCV) , 2016, pp. 251–263.
- 6[6] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio and video models from self-supervised synchronization,” in Advances in Neural Information Processing Systems , 2018, pp. 7763–7774.
- 7[7] A. Wrench, “Sono Speech: Ultrasound application for recording, client assessment and visual feedback,” Software v 217.10, 2018.
- 8[8] ——, “Articulate Assistant Advanced (AAA): Research application for recording and analysing ultrasound, EPG, EMA and other instrumental data,” Software v 217.10, 2018.
