Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

Paul-Andrei Pog\u{a}cean; Sanda-Maria Avram

arXiv:2507.16284·cs.CL·July 24, 2025

Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

Paul-Andrei Pog\u{a}cean, Sanda-Maria Avram

PDF

Open Access

TL;DR

This paper presents a mathematical approach using Minkowski norms and character bigram frequencies for language detection, achieving high accuracy across various text lengths and genres, emphasizing the effectiveness of classical methods.

Contribution

It introduces a novel frequency-based language detection algorithm leveraging Minkowski norms and character bigram rankings, demonstrating its effectiveness across diverse datasets.

Findings

01

Over 80% accuracy on texts shorter than 150 characters

02

100% accuracy on longer texts

03

Classical frequency-based methods are effective and scalable

Abstract

The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80\% accuracy on texts shorter than 150 characters and reaches 100\% accuracy for longer texts. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Text Readability and Simplification · Topic Modeling