T5lephone: Bridging Speech and Text Self-supervised Models for Spoken Language Understanding via Phoneme level T5
Chan-Jan Hsu, Ho-Lam Chung, Hung-yi Lee, Yu Tsao

TL;DR
This paper introduces T5lephone, a phoneme-based T5 model that improves spoken language understanding by better aligning speech and text representations, achieving state-of-the-art results on SQA and speech translation tasks.
Contribution
The paper explores the impact of tokenization strategies on PLMs for SLU and proposes T5lephone, a phoneme-based T5 variant pretrained with phonemicized text for enhanced speech-text understanding.
Findings
T5lephone outperforms T5 with other units on SQA and speech translation.
Phoneme-level tokenization improves alignment between speech and text models.
State-of-the-art results achieved on NMSQA dataset.
Abstract
In Spoken language understanding (SLU), a natural solution is concatenating pre-trained speech models (e.g. HuBERT) and pretrained language models (PLM, e.g. T5). Most previous works use pretrained language models with subword-based tokenization. However, the granularity of input units affects the alignment of speech model outputs and language model inputs, and PLM with character-based tokenization is underexplored. In this work, we conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task including spoken question answering (SQA) and speech translation (ST). We further extend the idea to create T5lephone(pronounced as telephone), a variant of T5 that is pretrained using phonemicized text. We initialize T5lephone with existing PLMs to pretrain it using relatively lightweight computational resources. We reached state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Gated Linear Unit · Inverse Square Root Schedule · Adafactor · Softmax
