Generating abbreviations using Google Books library

Valery D. Solovyev; Vladimir V. Bochkarev

arXiv:1410.1080·cs.CL·October 7, 2014

Generating abbreviations using Google Books library

Valery D. Solovyev, Vladimir V. Bochkarev

PDF

Open Access

TL;DR

This paper presents a universal method for generating abbreviation dictionaries using the Google Books Ngram Corpus, specifically tailored for Russian but adaptable to other languages, aiding text segmentation tasks.

Contribution

It introduces a novel approach to creating abbreviation dictionaries from large corpora, addressing challenges and proposing an error evaluation model for improved accuracy.

Findings

01

Developed a Russian abbreviation dictionary from Google Books data

02

Identified key difficulties and solutions in dictionary construction

03

Provided statistical insights into abbreviation usage

Abstract

The article describes the original method of creating a dictionary of abbreviations based on the Google Books Ngram Corpus. The dictionary of abbreviations is designed for Russian, yet as its methodology is universal it can be applied to any language. The dictionary can be used to define the function of the period during text segmentation in various applied systems of text processing. The article describes difficulties encountered in the process of its construction as well as the ways to overcome them. A model of evaluating a probability of first and second type errors (extraction accuracy and fullness) is constructed. Certain statistical data for the use of abbreviations are provided.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLexicography and Language Studies · Natural Language Processing Techniques · Language and cultural evolution