Non-Linear Scoring Model for Translation Quality Evaluation
Serge Gladkoff, Lifeng Han, Katerina Gasova

TL;DR
This paper introduces a non-linear translation quality scoring model based on psychophysical principles, improving fairness and reliability in evaluating translations of varying lengths.
Contribution
It presents a calibrated, non-linear error model that better aligns with human perception and addresses biases in traditional linear scoring methods.
Findings
Error counts grow logarithmically with sample size
Model improves inter-rater reliability
Enhances interpretability and fairness in evaluation
Abstract
Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Authorship Attribution and Profiling
