A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining
Masaaki Nagata, Makoto Morishita, Katsuki Chousa, Norihito Yasuda

TL;DR
This paper presents a Japanese-Chinese parallel corpus created through crowdsourcing, demonstrating that smaller, high-quality web-mined data can achieve translation accuracy comparable to larger datasets.
Contribution
It introduces a novel crowdsourcing approach to collect parallel web data and shows its effectiveness in building a high-quality Japanese-Chinese corpus for translation.
Findings
The corpus contains 4.6 million sentence pairs.
Models trained on this corpus achieve comparable accuracy to larger datasets.
Crowdsourcing is feasible for web mining of parallel data.
Abstract
Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Advanced Text Analysis Techniques · Spam and Phishing Detection
