A Study on the Appropriate size of the Mongolian general corpus

Sunsoo Choi; Ganbat Tsend

arXiv:2307.06050·cs.CL·July 13, 2023

A Study on the Appropriate size of the Mongolian general corpus

Sunsoo Choi, Ganbat Tsend

PDF

Open Access

TL;DR

This paper determines the optimal size of the Mongolian general corpus using Heaps function and TTR, concluding that 39 to 42 million tokens are sufficient for comprehensive coverage.

Contribution

It introduces a method to estimate the appropriate corpus size for Mongolian using quantitative linguistic measures.

Findings

01

TTR stabilizes beyond 39 million tokens

02

Heaps function effectively estimates corpus growth

03

Optimal corpus size is 39-42 million tokens

Abstract

This study aims to determine the appropriate size of the Mongolian general corpus. This study used the Heaps function and Type Token Ratio to determine the appropriate size of the Mongolian general corpus. The sample corpus of 906,064 tokens comprised texts from 10 domains of newspaper politics, economy, society, culture, sports, world articles and laws, middle and high school literature textbooks, interview articles, and podcast transcripts. First, we estimated the Heaps function with this sample corpus. Next, we observed changes in the number of types and TTR values while increasing the number of tokens by one million using the estimated Heaps function. As a result of observation, we found that the TTR value hardly changed when the number of tokens exceeded from 39 to 42 million. Thus, we conclude that an appropriate size for a Mongolian general corpus is from 39 to 42 million tokens.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques