A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain
Jorge del Pozo L\'erida, Kamil Kojs, J\'anos M\'at\'e, Miko{\l}aj, Antoni Bara\'nski, Christian Hardmeier

TL;DR
This study compares data filtering techniques like LASER, MUSE, and LaBSE for English-Polish biomedical machine translation, showing LASER's superior ability to reduce data size while maintaining or improving translation quality.
Contribution
It provides an empirical evaluation of filtering methods specifically for English-Polish biomedical translation, highlighting LASER's effectiveness over alternatives.
Findings
LASER and MUSE significantly reduce dataset sizes.
LASER outperforms other methods in translation quality.
Filtered datasets with LASER improve translation fluency.
Abstract
Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Biomedical Text Mining and Ontologies · linguistics and terminology studies
