Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?
Ioannis Tsiamas, Matthias Sperber, Andrew Finch, Sarthak Garg

TL;DR
This paper investigates whether speech-to-text translation systems effectively utilize prosody features like intonation and rhythm, introducing a new benchmark and methodology to evaluate prosody awareness in translation models.
Contribution
The authors propose a novel evaluation methodology and benchmark (ContraProST) for assessing prosody awareness in speech-to-text translation systems, and analyze how different models leverage prosodic information.
Findings
S2TT models have some internal prosody representation.
End-to-end systems outperform cascaded systems in prosody utilization.
Cascaded systems can capture prosody but less effectively depending on transcript surface form.
Abstract
The prosody of a spoken utterance, including features like stress, intonation and rhythm, can significantly affect the underlying semantics, and as a consequence can also affect its textual translation. Nevertheless, prosody is rarely studied within the context of speech-to-text translation (S2TT) systems. In particular, end-to-end (E2E) systems have been proposed as well-suited for prosody-aware translation because they have direct access to the speech signal when making translation decisions, but the understanding of whether this is successful in practice is still limited. A main challenge is the difficulty of evaluating prosody awareness in translation. To address this challenge, we introduce an evaluation methodology and a focused benchmark (named ContraProST) aimed at capturing a wide range of prosodic phenomena. Our methodology uses large language models and controllable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
