Speech is More Than Words: Do Speech-to-Text Translation Systems   Leverage Prosody?

Ioannis Tsiamas; Matthias Sperber; Andrew Finch; Sarthak Garg

arXiv:2410.24019·cs.CL·November 1, 2024

Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?

Ioannis Tsiamas, Matthias Sperber, Andrew Finch, Sarthak Garg

PDF

Open Access

TL;DR

This paper investigates whether speech-to-text translation systems effectively utilize prosody features like intonation and rhythm, introducing a new benchmark and methodology to evaluate prosody awareness in translation models.

Contribution

The authors propose a novel evaluation methodology and benchmark (ContraProST) for assessing prosody awareness in speech-to-text translation systems, and analyze how different models leverage prosodic information.

Findings

01

S2TT models have some internal prosody representation.

02

End-to-end systems outperform cascaded systems in prosody utilization.

03

Cascaded systems can capture prosody but less effectively depending on transcript surface form.

Abstract

The prosody of a spoken utterance, including features like stress, intonation and rhythm, can significantly affect the underlying semantics, and as a consequence can also affect its textual translation. Nevertheless, prosody is rarely studied within the context of speech-to-text translation (S2TT) systems. In particular, end-to-end (E2E) systems have been proposed as well-suited for prosody-aware translation because they have direct access to the speech signal when making translation decisions, but the understanding of whether this is successful in practice is still limited. A main challenge is the difficulty of evaluating prosody awareness in translation. To address this challenge, we introduce an evaluation methodology and a focused benchmark (named ContraProST) aimed at capturing a wide range of prosodic phenomena. Our methodology uses large language models and controllable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems