# Exploring Phoneme-Level Speech Representations for End-to-End Speech   Translation

**Authors:** Elizabeth Salesky, Matthias Sperber, Alan W Black

arXiv: 1906.01199 · 2019-06-05

## TL;DR

This paper demonstrates that using compressed phoneme-like speech representations significantly improves end-to-end speech translation performance and efficiency compared to traditional frame-level features.

## Contribution

The authors introduce a simple phoneme-based compression method that enhances translation quality and reduces training time across multiple language pairs and data sizes.

## Key findings

- Up to 5 BLEU improvement in translation quality.
- 60% reduction in training time.
- Effective across high and low resource language pairs.

## Abstract

Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two language pairs.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.01199/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/1906.01199/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/1906.01199/full.md

---
Source: https://tomesphere.com/paper/1906.01199