# Evaluation of basic modules for isolated spelling error correction in   Polish texts

**Authors:** Szymon Rutkowski

arXiv: 1905.10810 · 2019-05-28

## TL;DR

This paper evaluates various basic modules for correcting isolated spelling errors in Polish texts, using a large annotated corpus, and identifies promising approaches including semantic distance measures and LSTM models with ELMo embeddings.

## Contribution

It provides a comparative evaluation of different spelling correction modules on Polish data, highlighting effective methods and potential combinations for improved performance.

## Key findings

- Combining edit distance with cosine semantic distance is promising for interpretability.
- LSTM models with ELMo embeddings achieve the best raw correction performance.
- Semantic-based approaches can be integrated with traditional methods for better results.

## Abstract

Spelling error correction is an important problem in natural language processing, as a prerequisite for good performance in downstream tasks as well as an important feature in user-facing applications. For texts in Polish language, there exist works on specific error correction solutions, often developed for dealing with specialized corpora, but not evaluations of many different approaches on big resources of errors. We begin to address this problem by testing some basic and promising methods on PlEWi, a corpus of annotated spelling extracted from Polish Wikipedia. These modules may be further combined with appropriate solutions for error detection and context awareness. Following our results, combining edit distance with cosine distance of semantic vectors may be suggested for interpretable systems, while an LSTM, particularly enhanced by ELMo embeddings, seems to offer the best raw performance.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.10810/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/1905.10810/full.md

---
Source: https://tomesphere.com/paper/1905.10810