Using Language Models to Disambiguate Lexical Choices in Translation

Josh Barua; Sanjay Subramanian; Kayo Yin; Alane Suhr

arXiv:2411.05781·cs.CL·November 11, 2024

Using Language Models to Disambiguate Lexical Choices in Translation

Josh Barua, Sanjay Subramanian, Kayo Yin, Alane Suhr

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces DTAiLS, a dataset for lexical disambiguation in translation, evaluates language models on it, and shows how lexical rules can enhance translation accuracy.

Contribution

It creates a new dataset for lexical choice disambiguation and demonstrates how language models and lexical rules can improve translation quality.

Findings

01

GPT-4 achieves 67-85% accuracy on DTAiLS.

02

High-quality lexical rules can outperform GPT-4.

03

Language models can generate useful lexical disambiguation rules.

Abstract

In translation, a concept represented by a single word in a source language can have multiple variations in a target language. The task of lexical selection requires using context to identify which variation is most appropriate for a source text. We work with native speakers of nine languages to create DTAiLS, a dataset of 1,377 sentence pairs that exhibit cross-lingual concept variation when translating from English. We evaluate recent LLMs and neural machine translation systems on DTAiLS, with the best-performing model, GPT-4, achieving from 67 to 85% accuracy across languages. Finally, we use language models to generate English rules describing target-language concept variations. Providing weaker models with high-quality lexical rules improves accuracy substantially, in some cases reaching or outperforming GPT-4.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

berkeley-nlp/lex-rules
pytorchOfficial

Videos

Using Language Models to Disambiguate Lexical Choices in Translation· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsLinear Layer · Multi-Head Attention · Residual Connection · Softmax · Byte Pair Encoding · Dropout · Absolute Position Encodings · Attention Is All You Need · Dense Connections · Label Smoothing