MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation
Dominik Mach\'a\v{c}ek, Ond\v{r}ej Bojar, Raj Dabre

TL;DR
This study analyzes how well offline MT evaluation metrics correlate with human ratings in simultaneous speech translation, finding they are reliable proxies under current quality levels, especially when using translation as a reference.
Contribution
The paper provides an extensive correlation analysis between offline MT metrics and human ratings in SST, demonstrating their reliability and limitations for evaluation.
Findings
Offline metrics are well correlated with human ratings in SST.
Metrics correlate more strongly with translation as a reference than with interpreting.
Metrics can serve as proxies for human evaluation, reducing need for large-scale human ratings.
Abstract
There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate simultaneous speech translation (SST) but their correlations with human ratings of SST, which has been recently collected as Continuous Ratings (CR), are unclear. In this paper, we leverage the evaluations of candidate systems submitted to the English-German SST task at IWSLT 2022 and conduct an extensive correlation analysis of CR and the aforementioned metrics. Our study reveals that the offline metrics are well correlated with CR and can be reliably used for evaluating machine translation in simultaneous mode, with some limitations on the test set size. We conclude that given the current quality levels of SST, these metrics can be used as proxies for CR,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification
MethodsTest
