# End-to-end Adaptation with Backpropagation through WFST for On-device   Speech Recognition System

**Authors:** Emiru Tsunoo, Yosuke Kashiwagi, Satoshi Asakawa, Toshiyuki Kumakura

arXiv: 1905.07149 · 2019-06-25

## TL;DR

This paper introduces a novel end-to-end adaptation method for on-device speech recognition that jointly adapts acoustic models and WFSTs through backpropagation, improving performance across different environments and languages.

## Contribution

It proposes converting WFSTs into trainable neural networks and jointly adapting them with acoustic models using E2E training, a novel approach for speech system adaptation.

## Key findings

- Joint E2E training outperforms separate adaptations.
- Adapts well across different languages.
- Comparable to state-of-the-art adaptation methods.

## Abstract

An on-device DNN-HMM speech recognition system efficiently works with a limited vocabulary in the presence of a variety of predictable noise. In such a case, vocabulary and environment adaptation is highly effective. In this paper, we propose a novel method of end-to-end (E2E) adaptation, which adjusts not only an acoustic model (AM) but also a weighted finite-state transducer (WFST). We convert a pretrained WFST to a trainable neural network and adapt the system to target environments/vocabulary by E2E joint training with an AM. We replicate Viterbi decoding with forward--backward neural network computation, which is similar to recurrent neural networks (RNNs). By pooling output score sequences, a vocabulary posterior for each utterance is obtained and used for discriminative loss computation. Experiments using 2--10 hours of English/Japanese adaptation datasets indicate that the fine-tuning of only WFSTs and that of only AMs are both comparable to a state-of-the-art adaptation method, and E2E joint training of the two components achieves the best recognition performance. We also adapt each language system to the other language using the adaptation data, and the results show that the proposed method also works well for language adaptations.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.07149/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1905.07149/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/1905.07149/full.md

---
Source: https://tomesphere.com/paper/1905.07149