MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
Lifeng Han, Gareth J.F. Jones, Alan F. Smeaton

TL;DR
This paper introduces large-scale multilingual MWE corpora derived from parallel texts, demonstrating their utility in improving machine translation quality for German-English and Chinese-English pairs.
Contribution
It presents new, extensive bilingual MWE corpora extracted from parallel corpora, expanding resources available for NLP research and applications.
Findings
Improved translation quality for MWEs in MT experiments.
Large-scale bilingual MWE corpora available for research.
Enhanced evaluation scores using MWEs in MT models.
Abstract
Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as Machine Translation. However, the availability of bilingual or multi-lingual MWE corpora is very limited. The only bilingual MWE corpora that we are aware of is from the PARSEME (PARSing and Multi-word Expressions) EU Project. This is a small collection of only 871 pairs of English-German MWEs. In this paper, we present multi-lingual and bilingual MWE corpora that we have extracted from root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering. We examine the quality of these extracted bilingual MWEs in MT experiments. Our initial experiments applying MWEs in MT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
