# Learning Word-Like Units from Joint Audio-Visual Analysis

**Authors:** David Harwath, James R. Glass

arXiv: 1701.07481 · 2017-05-26

## TL;DR

This paper introduces a method for discovering and grounding word-like units directly from speech and images without relying on transcriptions, enabling spoken language understanding and semantic association.

## Contribution

It presents a novel joint audio-visual learning approach that identifies word-like units in speech and links them to relevant image regions without using traditional speech recognition or text annotations.

## Key findings

- Successfully detects spoken instances of words in continuous speech
- Effectively grounds spoken words to corresponding image regions
- Learns semantic associations between speech units and visual concepts

## Abstract

Given a collection of images and spoken audio captions, we present a method for discovering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions. For example, our model is able to detect spoken instances of the word 'lighthouse' within an utterance and associate them with image regions containing lighthouses. We do not use any form of conventional automatic speech recognition, nor do we use any text transcriptions or conventional linguistic annotations. Our model effectively implements a form of spoken language acquisition, in which the computer learns not only to recognize word categories by sound, but also to enrich the words it learns with semantics by grounding them in images.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1701.07481/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1701.07481/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/1701.07481/full.md

---
Source: https://tomesphere.com/paper/1701.07481