A Rule-Based Approach For Aligning Japanese-Spanish Sentences From A   Comparable Corpora

Jessica C. Ram\'irez; Yuji Matsumoto

arXiv:1211.4488·cs.CL·November 20, 2012

A Rule-Based Approach For Aligning Japanese-Spanish Sentences From A Comparable Corpora

Jessica C. Ram\'irez, Yuji Matsumoto

PDF

Open Access

TL;DR

This paper presents a rule-based method leveraging syntactic features and POS tagging to extract Japanese-Spanish parallel sentences from Wikipedia, aiming to build a parallel corpus for SMT.

Contribution

It introduces a novel rule-based approach focused on syntactic features for extracting Japanese-Spanish sentence pairs from comparable corpora.

Findings

01

Human evaluation shows promising results

02

Outperforms baseline methods

03

Effective extraction of parallel sentences

Abstract

The performance of a Statistical Machine Translation System (SMT) system is proportionally directed to the quality and length of the parallel corpus it uses. However for some pair of languages there is a considerable lack of them. The long term goal is to construct a Japanese-Spanish parallel corpus to be used for SMT, whereas, there are a lack of useful Japanese-Spanish parallel Corpus. To address this problem, In this study we proposed a method for extracting Japanese-Spanish Parallel Sentences from Wikipedia using POS tagging and Rule-Based approach. The main focus of this approach is the syntactic features of both languages. Human evaluation was performed over a sample and shows promising results, in comparison with the baseline.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling