Generating Multilingual Parallel Corpus Using Subtitles
Farshad Jafari

TL;DR
This paper presents a method to automatically generate large-scale multilingual parallel corpora from open-source video subtitles, addressing the scarcity of such data for many language pairs in neural machine translation.
Contribution
It introduces an automated process to extract synchronized sentence pairs from online subtitles, enabling the creation of context-rich and informal style corpora for diverse languages.
Findings
Successfully extracted parallel corpora from subtitles for multiple languages.
Enabled the creation of informal language datasets for translation models.
Improved translation quality by incorporating context and informal speech styles.
Abstract
Neural Machine Translation with its significant results, still has a great problem: lack or absence of parallel corpus for many languages. This article suggests a method for generating considerable amount of parallel corpus for any language pairs, extracted from open source materials existing on the Internet. Parallel corpus contents will be derived from video subtitles. It needs a set of video titles, with some attributes like release date, rating, duration and etc. Process of finding and downloading subtitle pairs for desired language pairs is automated by using a crawler. Finally sentence pairs will be extracted from synchronous dialogues in subtitles. The main problem of this method is unsynchronized subtitle pairs. Therefore subtitles will be verified before downloading. If two subtitle were not synchronized, then another subtitle of that video will be processed till it finds the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Topic Modeling
