Using Language Models to Disambiguate Lexical Choices in Translation
Josh Barua, Sanjay Subramanian, Kayo Yin, Alane Suhr

TL;DR
This paper introduces DTAiLS, a dataset for lexical disambiguation in translation, evaluates language models on it, and shows how lexical rules can enhance translation accuracy.
Contribution
It creates a new dataset for lexical choice disambiguation and demonstrates how language models and lexical rules can improve translation quality.
Findings
GPT-4 achieves 67-85% accuracy on DTAiLS.
High-quality lexical rules can outperform GPT-4.
Language models can generate useful lexical disambiguation rules.
Abstract
In translation, a concept represented by a single word in a source language can have multiple variations in a target language. The task of lexical selection requires using context to identify which variation is most appropriate for a source text. We work with native speakers of nine languages to create DTAiLS, a dataset of 1,377 sentence pairs that exhibit cross-lingual concept variation when translating from English. We evaluate recent LLMs and neural machine translation systems on DTAiLS, with the best-performing model, GPT-4, achieving from 67 to 85% accuracy across languages. Finally, we use language models to generate English rules describing target-language concept variations. Providing weaker models with high-quality lexical rules improves accuracy substantially, in some cases reaching or outperforming GPT-4.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsLinear Layer · Multi-Head Attention · Residual Connection · Softmax · Byte Pair Encoding · Dropout · Absolute Position Encodings · Attention Is All You Need · Dense Connections · Label Smoothing
