Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation
Shenbin Qian, Yves Scherrer

TL;DR
This paper investigates why large language models struggle with low-resource machine translation by analyzing token utilization and introducing the Token Activation Rate metric.
Contribution
It introduces the Token Activation Rate (TAR) as a novel metric to understand token utilization and its impact on translation performance in LLMs.
Findings
Lower TAR correlates with poorer translation quality.
Non-English-centric language pairs have lower performance.
Reasoning LLMs generate more tokens for low-TAR languages, indicating a compensatory mechanism.
Abstract
Large Language Models (LLMs) have recently demonstrated strong performance in machine translation (MT). However, most prior work focuses on improving or benchmarking translation quality, offering limited insight into when and why LLM-based translation fails. In this work, we systematically analyze failure modes of LLMs in MT by evaluating 15 models, including four reasoning LLMs, across 22 language pairs (LPs) with varying resource levels. We find that non-English-centric LPs consistently yield lower COMET scores than English-centric pairs. To investigate the underlying causes, we introduce Token Activation Rate (TAR), a metric that captures how effectively a model utilizes language-specific tokens in its vocabulary during generation. We validate TAR as a proxy for language representation using models with known language distributions in the training data, and show that lower TAR is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
