# Using of heterogeneous corpora for training of an ASR system

**Authors:** Jan Trmal, Gaurav Kumar, Vimal Manohar, Sanjeev Khudanpur, Matt Post,, Paul McNamee

arXiv: 1706.00321 · 2017-06-02

## TL;DR

This paper explores methods to effectively utilize heterogeneous speech corpora for training a low-resource language ASR system, highlighting challenges and solutions in data integration.

## Contribution

It introduces techniques for leveraging diverse speech datasets to improve ASR performance in low-resource language scenarios.

## Key findings

- Simple concatenation of corpora yields limited benefits
- Advanced data integration methods improve recognition accuracy
- Heterogeneous data can be effectively used with proper techniques

## Abstract

The paper summarizes the development of the LVCSR system built as a part of the Pashto speech-translation system at the SCALE (Summer Camp for Applied Language Exploration) 2015 workshop on "Speech-to-text-translation for low-resource languages". The Pashto language was chosen as a good "proxy" low-resource language, exhibiting multiple phenomena which make the speech-recognition and and speech-to-text-translation systems development hard.   Even when the amount of data is seemingly sufficient, given the fact that the data originates from multiple sources, the preliminary experiments reveal that there is little to no benefit in merging (concatenating) the corpora and more elaborate ways of making use of all of the data must be worked out.   This paper concentrates only on the LVCSR part and presents a range of different techniques that were found to be useful in order to benefit from multiple different corpora

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1706.00321/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1706.00321/full.md

## References

12 references — full list in the complete paper: https://tomesphere.com/paper/1706.00321/full.md

---
Source: https://tomesphere.com/paper/1706.00321