# Synchronising audio and ultrasound by learning cross-modal embeddings

**Authors:** Aciel Eshky, Manuel Sam Ribeiro, Korin Richmond, Steve Renals

arXiv: 1907.00758 · 2019-11-28

## TL;DR

This paper introduces a neural network approach to automatically synchronise speech audio with ultrasound tongue videos post-recording, improving efficiency in child speech therapy by reducing manual effort.

## Contribution

It presents a novel two-stream neural network model that learns cross-modal embeddings to synchronise audio and ultrasound signals after recording, addressing hardware failure issues.

## Key findings

- Achieves 82.9% correct synchronisation on unseen data
- More effective on natural speech variations than directed phonations
- Reduces manual synchronisation workload significantly

## Abstract

Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the signals post hoc. To address this problem, we employ a two-stream neural network which exploits the correlation between the two modalities to find the offset. We train our model on recordings from 69 speakers, and show that it correctly synchronises 82.9% of test utterances from unseen therapy sessions and unseen speakers, thus considerably reducing the number of utterances to be manually synchronised. An analysis of model performance on the test utterances shows that directed phone articulations are more difficult to automatically synchronise compared to utterances containing natural variation in speech such as words, sentences, or conversations.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.00758/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/1907.00758/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/1907.00758/full.md

---
Source: https://tomesphere.com/paper/1907.00758