Open-Set Language Identification

Shervin Malmasi

arXiv:1707.04817·cs.CL·July 18, 2017

Open-Set Language Identification

Shervin Malmasi

PDF

Open Access

TL;DR

This paper introduces a novel open-set language identification method using one-class classification with hashing-based feature vectors, achieving high accuracy across multiple languages with different writing systems.

Contribution

It proposes a new hashing-based feature vectorization approach and demonstrates its effectiveness for open-set language identification with one-class classifiers.

Findings

01

Achieved an average F-score of 0.99 across 10 languages.

02

Identified shortcomings of traditional feature extraction methods.

03

Validated the approach on diverse writing systems.

Abstract

We present the first open-set language identification experiments using one-class classification. We first highlight the shortcomings of traditional feature extraction methods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One- Class Support Vector Machine using only a monolingual corpus for each language. Each model is evaluated against a test set of data from all 10 languages and we achieve an average F-score of 0.99, highlighting the effectiveness of this approach for open-set language identification.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Text and Document Classification Technologies