# Towards multi-task learning of speech and speaker recognition

**Authors:** Nik Vaessen, David A. van Leeuwen

arXiv: 2302.12773 · 2023-05-29

## TL;DR

This paper explores multi-task learning for speech and speaker recognition using wav2vec2, demonstrating shared embeddings but highlighting challenges with out-of-distribution data performance.

## Contribution

It introduces a multi-task learning approach with architectural and optimization strategies for speech and speaker recognition, revealing limitations in out-of-distribution generalization.

## Key findings

- Shared embeddings achieve comparable in-distribution performance to single-task models
- Multi-task models perform worse on out-of-distribution data
- Architectural choices influence the effectiveness of multi-task learning

## Abstract

We study multi-task learning for two orthogonal speech technology tasks: speech and speaker recognition. We use wav2vec2 as a base architecture with two task-specific output heads. We experiment with different architectural decisions to mix speaker and speech information in the output sequence as well as different optimization strategies. Our multi-task learning networks can produce a shared speaker and speech embedding, which on first glance achieve a performance comparable to separate single-task models. However, we show that the multi-task networks have strongly degraded performance on out-of-distribution evaluation data compared to the single-task models. Code and model checkpoints are available at https://github.com/nikvaessen/disjoint-mtl

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.12773/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/2302.12773/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/2302.12773/full.md

---
Source: https://tomesphere.com/paper/2302.12773