# Symbolic inductive bias for visually grounded learning of spoken   language

**Authors:** Grzegorz Chrupa{\l}a

arXiv: 1812.09244 · 2023-06-02

## TL;DR

This paper introduces a multitask learning approach that leverages transcribed speech to improve visually grounded spoken language understanding, demonstrating significant performance gains in image retrieval tasks.

## Contribution

It proposes a novel three-task architecture combining speech-image, speech-text, and text-image objectives, highlighting the benefits of symbolic inductive bias from transcribed speech.

## Key findings

- Adding speech/text task improves image retrieval performance
- Transcribed speech provides a strong inductive bias
- Multitask learning enhances end-to-end spoken language models

## Abstract

A widespread approach to processing spoken language is to first automatically transcribe it into text. An alternative is to use an end-to-end approach: recent works have proposed to learn semantic embeddings of spoken language from images with spoken captions, without an intermediate transcription step. We propose to use multitask learning to exploit existing transcribed speech within the end-to-end setting. We describe a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images. We show that the addition of the speech/text task leads to substantial performance improvements on image retrieval when compared to training the speech/image task in isolation. We conjecture that this is due to a strong inductive bias transcribed speech provides to the model, and offer supporting evidence for this.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1812.09244/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/1812.09244/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/1812.09244/full.md

---
Source: https://tomesphere.com/paper/1812.09244