Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis
Paul-Andrei Pog\u{a}cean, Sanda-Maria Avram

TL;DR
This paper presents a mathematical approach using Minkowski norms and character bigram frequencies for language detection, achieving high accuracy across various text lengths and genres, emphasizing the effectiveness of classical methods.
Contribution
It introduces a novel frequency-based language detection algorithm leveraging Minkowski norms and character bigram rankings, demonstrating its effectiveness across diverse datasets.
Findings
Over 80% accuracy on texts shorter than 150 characters
100% accuracy on longer texts
Classical frequency-based methods are effective and scalable
Abstract
The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80\% accuracy on texts shorter than 150 characters and reaches 100\% accuracy for longer texts. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Text Readability and Simplification · Topic Modeling
