Hierarchical Character-Word Models for Language Identification

Aaron Jaech; George Mulcaire; Shobhit Hathi; Mari Ostendorf; and Noah A. Smith

arXiv:1608.03030·cs.CL·August 11, 2016

Hierarchical Character-Word Models for Language Identification

Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari Ostendorf, and Noah A. Smith

PDF

1 Repo

TL;DR

This paper presents a hierarchical model that combines character and word-level representations to improve language identification in social media messages, effectively handling challenges like brevity and unconventional spelling.

Contribution

The paper introduces a novel hierarchical character-word model that enhances language identification accuracy and can detect code-switching in social media text.

Findings

01

Outperforms strong baseline models

02

Effective in identifying language in brief, informal texts

03

Capable of revealing code-switching instances

Abstract

Social media messages' brevity and unconventional spelling pose a challenge to language identification. We introduce a hierarchical model that learns character and contextualized word-level representations for language identification. Our method performs well against strong base- lines, and can also reveal code-switching.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ajaech/twitter_langid
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.