AlloST: Low-resource Speech Translation without Source Transcription
Yao-Fei Cheng, Hung-Shin Lee, and Hsin-Min Wang

TL;DR
This paper introduces a low-resource speech translation framework that leverages a universal phone recognizer and phonetic embeddings, significantly improving translation quality without relying on source transcriptions.
Contribution
It proposes a novel attention-based sequence-to-sequence model utilizing phonetic embeddings and BPE segmentation, advancing low-resource speech translation without source transcription.
Findings
Outperforms conformer-based baseline models
Achieves performance close to methods using source transcription
Effective on Spanish-English and Mandarin dialect corpora
Abstract
The end-to-end architecture has made promising progress in speech translation (ST). However, the ST task is still challenging under low-resource conditions. Most ST models have shown unsatisfactory results, especially in the absence of word information from the source speech utterance. In this study, we survey methods to improve ST performance without using source transcription, and propose a learning framework that utilizes a language-independent universal phone recognizer. The framework is based on an attention-based sequence-to-sequence model, where the encoder generates the phonetic embeddings and phone-aware acoustic representations, and the decoder controls the fusion of the two embedding streams to produce the target token sequence. In addition to investigating different fusion strategies, we explore the specific usage of byte pair encoding (BPE), which compresses a phone…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
