# Models of Visually Grounded Speech Signal Pay Attention To Nouns: a   Bilingual Experiment on English and Japanese

**Authors:** William N. Havard, Jean-Pierre Chevrot, Laurent Besacier

arXiv: 1902.03052 · 2019-02-11

## TL;DR

This study shows that neural models of visually grounded speech focus on nouns across English and Japanese, and that their attention patterns resemble human attention, with applications in cross-lingual speech retrieval.

## Contribution

It demonstrates that attention in neural models consistently targets nouns in two typologically different languages and explores parallels with human attention mechanisms.

## Key findings

- Attention focuses on nouns in both languages.
- Neural attention aligns with human attention on word endings.
- Bilingual corpora with annotations are provided for research.

## Abstract

We investigate the behaviour of attention in neural models of visually grounded speech trained on two languages: English and Japanese. Experimental results show that attention focuses on nouns and this behaviour holds true for two very typologically different languages. We also draw parallels between artificial neural attention and human attention and show that neural attention focuses on word endings as it has been theorised for human attention. Finally, we investigate how two visually grounded monolingual models can be used to perform cross-lingual speech-to-speech retrieval. For both languages, the enriched bilingual (speech-image) corpora with part-of-speech tags and forced alignments are distributed to the community for reproducible research.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1902.03052/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/1902.03052/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/1902.03052/full.md

---
Source: https://tomesphere.com/paper/1902.03052