Enhancing Scientific Discourse: Machine Translation for the Scientific Domain
Dimitris Roussis, Sokratis Sofianopoulos, Stelios Piperidis

TL;DR
This paper develops specialized scientific corpora for Spanish-English, French-English, and Portuguese-English to improve machine translation in scientific research, addressing domain-specific challenges.
Contribution
It introduces a new collection of domain-specific parallel and monolingual corpora for scientific translation and evaluates their effectiveness in fine-tuning neural machine translation systems.
Findings
Corpora improve translation quality in scientific domains.
Fine-tuning with domain-specific data enhances NMT performance.
Evaluation results demonstrate the value of specialized scientific corpora.
Abstract
The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora for the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the domains of: Cancer Research, Energy Research, Neuroscience, and Transportation research. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
