Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC
Chanjun Park, Midan Shim, Sugyeong Eo, Seolhwa Lee, Jaehyung Seo,, Hyeonseok Moon, Heuiseok Lim

TL;DR
This paper evaluates Korean public parallel corpora for neural machine translation using LIWC, revealing linguistic features that correlate with translation quality and guiding future improvements in corpus quality.
Contribution
First to apply LIWC analysis to Korean parallel corpora, providing insights into linguistic features affecting NMT performance and suggesting directions for corpus enhancement.
Findings
LIWC effectively characterizes linguistic features of Korean parallel corpora.
Correlations identified between LIWC features and NMT translation quality.
Results guide future efforts to improve corpus quality for Korean NMT.
Abstract
Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies
