On the Strength of Character Language Models for Multilingual Named   Entity Recognition

Xiaodong Yu; Stephen Mayhew; Mark Sammons; Dan Roth

arXiv:1809.05157·cs.CL·September 21, 2018

On the Strength of Character Language Models for Multilingual Named Entity Recognition

Xiaodong Yu, Stephen Mayhew, Mark Sammons, Dan Roth

PDF

Open Access

TL;DR

This paper investigates the ability of character-level language models to distinguish named entity tokens from non-entity tokens across multiple languages, showing they are effective and can enhance existing NER systems.

Contribution

It demonstrates that corpus-agnostic character-level language models can effectively identify named entities and improve multilingual NER performance.

Findings

01

CLMs accurately distinguish name tokens across languages

02

Adding CLM-based features improves NER system performance

03

CLMs perform close to full NER systems in identifying named entities

Abstract

Character-level patterns have been widely used as features in English Named Entity Recognition (NER) systems. However, to date there has been no direct investigation of the inherent differences between name and non-name tokens in text, nor whether this property holds across multiple languages. This paper analyzes the capabilities of corpus-agnostic Character-level Language Models (CLMs) in the binary task of distinguishing name tokens from non-name tokens. We demonstrate that CLMs provide a simple and powerful model for capturing these differences, identifying named entity tokens in a diverse set of languages at close to the performance of full NER systems. Moreover, by adding very simple CLM-based features we can significantly improve the performance of an off-the-shelf NER system for multiple languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification