Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?
Shenbin Qian, Constantin Or\u{a}san, Diptesh Kanojia, F\'elix do Carmo

TL;DR
This study evaluates large language models as quality estimators for machine translation of user-generated content, finding PEFT improves performance but issues like refusals and instability remain.
Contribution
It demonstrates that parameter-efficient fine-tuning enhances LLMs' ability to estimate translation quality with human-interpretable explanations for UGC.
Findings
PEFT improves score prediction accuracy
LLMs provide human-interpretable explanations
Challenges include refusal to reply and output instability
Abstract
This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC) that contains emotional expressions, without the use of reference translations. To achieve this, we employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Quality Metrics. We compare the accuracy of several LLMs with that of our fine-tuned baseline models, under in-context learning and parameter-efficient fine-tuning (PEFT) scenarios. We find that PEFT of LLMs leads to better performance in score prediction with human interpretable explanations than fine-tuned models. However, a manual analysis of LLM outputs reveals that they still have problems such as refusal to reply to a prompt and unstable output while evaluating machine translation of UGC.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
