# On the Contributions of Visual and Textual Supervision in Low-Resource   Semantic Speech Retrieval

**Authors:** Ankita Pasad, Bowen Shi, Herman Kamper, Karen Livescu

arXiv: 1904.10947 · 2019-09-04

## TL;DR

This paper investigates how combining visual and textual supervision improves semantic speech retrieval in low-resource settings, showing that visual grounding enhances performance even with some textual data available.

## Contribution

It introduces a multitask learning approach that leverages both visual and textual modalities, demonstrating the added benefit of visual grounding in low-resource speech retrieval.

## Key findings

- Visual grounding improves semantic speech retrieval performance.
- Using ~5 hours of transcribed speech with visual supervision yields 23% higher average precision.
- Visual and textual supervision together outperform either modality alone.

## Abstract

Recent work has shown that speech paired with images can be used to learn semantically meaningful speech representations even without any textual supervision. In real-world low-resource settings, however, we often have access to some transcribed speech. We study whether and how visual grounding is useful in the presence of varying amounts of textual supervision. In particular, we consider the task of semantic speech retrieval in a low-resource setting. We use a previously studied data set and task, where models are trained on images with spoken captions and evaluated on human judgments of semantic relevance. We propose a multitask learning approach to leverage both visual and textual modalities, with visual supervision in the form of keyword probabilities from an external tagger. We find that visual grounding is helpful even in the presence of textual supervision, and we analyze this effect over a range of sizes of transcribed data sets. With ~5 hours of transcribed speech, we obtain 23% higher average precision when also using visual supervision.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.10947/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1904.10947/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/1904.10947/full.md

---
Source: https://tomesphere.com/paper/1904.10947