A Rule-Based Approach For Aligning Japanese-Spanish Sentences From A Comparable Corpora
Jessica C. Ram\'irez, Yuji Matsumoto

TL;DR
This paper presents a rule-based method leveraging syntactic features and POS tagging to extract Japanese-Spanish parallel sentences from Wikipedia, aiming to build a parallel corpus for SMT.
Contribution
It introduces a novel rule-based approach focused on syntactic features for extracting Japanese-Spanish sentence pairs from comparable corpora.
Findings
Human evaluation shows promising results
Outperforms baseline methods
Effective extraction of parallel sentences
Abstract
The performance of a Statistical Machine Translation System (SMT) system is proportionally directed to the quality and length of the parallel corpus it uses. However for some pair of languages there is a considerable lack of them. The long term goal is to construct a Japanese-Spanish parallel corpus to be used for SMT, whereas, there are a lack of useful Japanese-Spanish parallel Corpus. To address this problem, In this study we proposed a method for extracting Japanese-Spanish Parallel Sentences from Wikipedia using POS tagging and Rule-Based approach. The main focus of this approach is the syntactic features of both languages. Human evaluation was performed over a sample and shows promising results, in comparison with the baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
