Creating Domain-Specific Translation Memories for Machine Translation Fine-tuning: The TRENCARD Bilingual Cardiology Corpus
Gokhan Dogru

TL;DR
This paper presents a semi-automatic methodology for creating domain-specific translation memories, demonstrated through building a Turkish-English cardiology corpus to enhance machine translation and fine-tuning.
Contribution
It introduces a semi-automatic approach leveraging translation tools for efficient creation of high-quality domain-specific translation memories.
Findings
Built TRENCARD corpus with 800,000 words and 50,000 sentences
Method enables quick creation of custom translation memories
Corpus improves domain-specific machine translation performance
Abstract
This article investigates how translation memories (TM) can be created by translators or other language professionals in order to compile domain-specific parallel corpora , which can then be used in different scenarios, such as machine translation training and fine-tuning, TM leveraging, and/or large language model fine-tuning. The article introduces a semi-automatic TM preparation methodology leveraging primarily translation tools used by translators in favor of data quality and control by the translators. This semi-automatic methodology is then used to build a cardiology-based Turkish -> English corpus from bilingual abstracts of Turkish cardiology journals. The resulting corpus called TRENCARD Corpus has approximately 800,000 source words and 50,000 sentences. Using this methodology, translators can build their custom TMs in a reasonable time and use them in their bilingual data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
