SpeechQE: Estimating the Quality of Direct Speech Translation
HyoJung Han, Kevin Duh, Marine Carpuat

TL;DR
This paper introduces SpeechQE, a new benchmark and system for estimating the quality of direct speech translation, highlighting the advantages of end-to-end models over cascaded approaches.
Contribution
It formulates the SpeechQE task, creates a benchmark, and evaluates novel end-to-end systems using pre-trained text LLMs for speech translation quality estimation.
Findings
End-to-end models outperform cascaded systems in quality estimation.
Pre-trained text LLMs enhance end-to-end speech translation quality estimation.
The paper releases data and models to foster further research.
Abstract
Recent advances in automatic quality estimation for machine translation have exclusively focused on written language, leaving the speech modality underexplored. In this work, we formulate the task of quality estimation for speech translation (SpeechQE), construct a benchmark, and evaluate a family of systems based on cascaded and end-to-end architectures. In this process, we introduce a novel end-to-end system leveraging pre-trained text LLM. Results suggest that end-to-end approaches are better suited to estimating the quality of direct speech translation than using quality estimation systems designed for text in cascaded systems. More broadly, we argue that quality estimation of speech translation needs to be studied as a separate problem from that of text, and release our data and models to guide further research in this space.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
