When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages
Archchana Sindhujan, Diptesh Kanojia, Constantin Orasan, Shenbin Qian

TL;DR
This paper explores reference-less machine translation quality estimation for low-resource languages, comparing large language models and fine-tuned models, and highlights the need for better cross-lingual pre-training.
Contribution
It introduces a novel prompt-based approach and provides a comprehensive evaluation of LLMs versus fine-tuned models for low-resource language QE.
Findings
Fine-tuned QE models outperform prompt-based LLM approaches.
Tokenization, transliteration, and named entity errors are major challenges.
Public release of data and models supports further research.
Abstract
This paper investigates the reference-less evaluation of machine translation for low-resource language pairs, known as quality estimation (QE). Segment-level QE is a challenging cross-lingual language understanding task that provides a quality score (0-100) to the translated output. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios and perform instruction fine-tuning using a novel prompt based on annotation guidelines. Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models. Our error analysis reveals tokenization issues, along with errors due to transliteration and named entities, and argues for refinement in LLM pre-training for cross-lingual tasks. We release the data, and models trained publicly for further research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
