T5lephone: Bridging Speech and Text Self-supervised Models for Spoken   Language Understanding via Phoneme level T5

Chan-Jan Hsu; Ho-Lam Chung; Hung-yi Lee; Yu Tsao

arXiv:2211.00586·cs.CL·November 2, 2022

T5lephone: Bridging Speech and Text Self-supervised Models for Spoken Language Understanding via Phoneme level T5

Chan-Jan Hsu, Ho-Lam Chung, Hung-yi Lee, Yu Tsao

PDF

Open Access 1 Repo

TL;DR

This paper introduces T5lephone, a phoneme-based T5 model that improves spoken language understanding by better aligning speech and text representations, achieving state-of-the-art results on SQA and speech translation tasks.

Contribution

The paper explores the impact of tokenization strategies on PLMs for SLU and proposes T5lephone, a phoneme-based T5 variant pretrained with phonemicized text for enhanced speech-text understanding.

Findings

01

T5lephone outperforms T5 with other units on SQA and speech translation.

02

Phoneme-level tokenization improves alignment between speech and text models.

03

State-of-the-art results achieved on NMSQA dataset.

Abstract

In Spoken language understanding (SLU), a natural solution is concatenating pre-trained speech models (e.g. HuBERT) and pretrained language models (PLM, e.g. T5). Most previous works use pretrained language models with subword-based tokenization. However, the granularity of input units affects the alignment of speech model outputs and language model inputs, and PLM with character-based tokenization is underexplored. In this work, we conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task including spoken question answering (SQA) and speech translation (ST). We further extend the idea to create T5lephone(pronounced as telephone), a variant of T5 that is pretrained using phonemicized text. We initialize T5lephone with existing PLMs to pretrain it using relatively lightweight computational resources. We reached state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

splend1d/t5lephone
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Gated Linear Unit · Inverse Square Root Schedule · Adafactor · Softmax