# Towards Visually Grounded Sub-Word Speech Unit Discovery

**Authors:** David Harwath, James Glass

arXiv: 1902.08213 · 2019-02-25

## TL;DR

This paper explores how interpretable sub-word units naturally emerge in neural networks trained on speech and images, revealing potential mechanisms for word recognition through activation patterns.

## Contribution

It demonstrates that diphone boundaries can be extracted from neural network activations, indicating the model's use of sub-word units for speech understanding.

## Key findings

- Diphone boundaries can be identified from model activations.
- Neural networks encode sub-word speech information.
- Activation patterns relate to word recognition processes.

## Abstract

In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging these events for the purpose of word recognition. We present a series of experiments investigating the information encoded by these events.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1902.08213/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/1902.08213/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/1902.08213/full.md

---
Source: https://tomesphere.com/paper/1902.08213