TL;DR
This paper reproduces Apple's bi-directional LSTM model for language identification in short text snippets, confirming its superior performance over open-source alternatives and analyzing its common errors.
Contribution
It provides a detailed reproduction and validation of Apple's language identification model, highlighting its effectiveness and common confusion errors.
Findings
Bi-LSTM model outperforms open-source language identifiers.
Model's errors mainly involve confusion between related languages.
Reproduction confirms original performance claims.
Abstract
Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM
