A reproduction of Apple's bi-directional LSTM models for language   identification in short strings

Mads Toftrup; S{\o}ren Asger S{\o}rensen; Manuel R. Ciosici; Ira; Assent

arXiv:2102.06282·cs.CL·February 15, 2021

A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Mads Toftrup, S{\o}ren Asger S{\o}rensen, Manuel R. Ciosici, Ira, Assent

PDF

1 Repo

TL;DR

This paper reproduces Apple's bi-directional LSTM model for language identification in short text snippets, confirming its superior performance over open-source alternatives and analyzing its common errors.

Contribution

It provides a detailed reproduction and validation of Apple's language identification model, highlighting its effectiveness and common confusion errors.

Findings

01

Bi-LSTM model outperforms open-source language identifiers.

02

Model's errors mainly involve confusion between related languages.

03

Reproduction confirms original performance claims.

Abstract

Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AU-DIS/LSTM_langid
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM