Creating Domain-Specific Translation Memories for Machine Translation   Fine-tuning: The TRENCARD Bilingual Cardiology Corpus

Gokhan Dogru

arXiv:2409.02667·cs.CL·September 5, 2024

Creating Domain-Specific Translation Memories for Machine Translation Fine-tuning: The TRENCARD Bilingual Cardiology Corpus

Gokhan Dogru

PDF

Open Access

TL;DR

This paper presents a semi-automatic methodology for creating domain-specific translation memories, demonstrated through building a Turkish-English cardiology corpus to enhance machine translation and fine-tuning.

Contribution

It introduces a semi-automatic approach leveraging translation tools for efficient creation of high-quality domain-specific translation memories.

Findings

01

Built TRENCARD corpus with 800,000 words and 50,000 sentences

02

Method enables quick creation of custom translation memories

03

Corpus improves domain-specific machine translation performance

Abstract

This article investigates how translation memories (TM) can be created by translators or other language professionals in order to compile domain-specific parallel corpora , which can then be used in different scenarios, such as machine translation training and fine-tuning, TM leveraging, and/or large language model fine-tuning. The article introduces a semi-automatic TM preparation methodology leveraging primarily translation tools used by translators in favor of data quality and control by the translators. This semi-automatic methodology is then used to build a cardiology-based Turkish -> English corpus from bilingual abstracts of Turkish cardiology journals. The resulting corpus called TRENCARD Corpus has approximately 800,000 source words and 50,000 sentences. Using this methodology, translators can build their custom TMs in a reasonable time and use them in their bilingual data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques