A Critical Study of Automatic Evaluation in Sign Language Translation
Shakib Yazdani, Yasser Hamidullah, Cristina Espa\~na-Bonet, Eleftherios Avramidis, Josef van Genabith

TL;DR
This study critically evaluates the effectiveness of current text-based automatic metrics for sign language translation, highlighting their limitations and the potential of LLM-based evaluators for more accurate assessments.
Contribution
It provides a comprehensive analysis of six evaluation metrics, including LLM-based methods, under various conditions, revealing their strengths and biases in assessing SLT output quality.
Findings
Lexical overlap metrics are limited in capturing semantic quality.
LLM-based evaluators better detect semantic equivalence but show bias.
BLEU is overly sensitive to minor variations.
Abstract
Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
