# Learning weakly supervised multimodal phoneme embeddings

**Authors:** Rahma Chaabouni, Ewan Dunbar, Neil Zeghidour, Emmanuel Dupoux

arXiv: 1704.06913 · 2017-10-19

## TL;DR

This paper investigates how combining audio and visual lip movement data in a weakly supervised setting using Siamese networks can improve phoneme recognition and phonological feature discriminability.

## Contribution

It introduces mono-task and multi-task methods for multimodal phoneme embedding learning, demonstrating enhanced discriminability and linguistic relevance, especially from visual input.

## Key findings

- Multi-task learning improves discriminability of visual and multimodal inputs.
- Cross-modal visual input enhances phonological feature discrimination.
- Visual features related to lip movements can be integrated to improve phoneme representations.

## Abstract

Recent works have explored deep architectures for learning multimodal speech representation (e.g. audio and images, articulation and audio) in a supervised way. Here we investigate the role of combining different speech modalities, i.e. audio and visual information representing the lips movements, in a weakly supervised way using Siamese networks and lexical same-different side information. In particular, we ask whether one modality can benefit from the other to provide a richer representation for phone recognition in a weakly supervised setting. We introduce mono-task and multi-task methods for merging speech and visual modalities for phone recognition. The mono-task learning consists in applying a Siamese network on the concatenation of the two modalities, while the multi-task learning receives several different combinations of modalities at train time. We show that multi-task learning enhances discriminability for visual and multimodal inputs while minimally impacting auditory inputs. Furthermore, we present a qualitative analysis of the obtained phone embeddings, and show that cross-modal visual input can improve the discriminability of phonological features which are visually discernable (rounding, open/close, labial place of articulation), resulting in representations that are closer to abstract linguistic features than those based on audio only.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1704.06913/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1704.06913/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/1704.06913/full.md

---
Source: https://tomesphere.com/paper/1704.06913