LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation
Yongjing Yin, Jiali Zeng, Yafu Li, Fandong Meng, Yue Zhang

TL;DR
LexMatcher is a dictionary-driven data collection method that improves LLM-based machine translation by enhancing sense coverage and handling polysemy, leading to better performance on standard benchmarks.
Contribution
This paper introduces LexMatcher, a novel data curation approach based on bilingual dictionaries, specifically designed for instruction fine-tuning of LLMs in machine translation.
Findings
Outperforms baselines on WMT2022 test sets
Improves word sense disambiguation and terminology translation
Effective in handling infrequent senses of polysemous words
Abstract
The fine-tuning of open-source large language models (LLMs) for machine translation has recently received considerable attention, marking a shift towards data-centric research from traditional neural machine translation. However, the area of data collection for instruction fine-tuning in machine translation remains relatively underexplored. In this paper, we present LexMatcher, a simple yet effective method for data curation, the design of which is driven by the coverage of senses found in bilingual dictionaries. The construction process comprises data retrieval from an existing corpus and data augmentation that supplements the infrequent senses of polysemous words. Utilizing LLaMA2 as our base model, our approach outperforms the established baselines on the WMT2022 test sets and also exhibits remarkable performance in tasks related to word sense disambiguation and specialized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing · Translation Studies and Practices
MethodsBalanced Selection
