
TL;DR
This paper investigates language identification of names and short text fragments, introducing a new corpus and comparing general language models with names-only models to evaluate their effectiveness.
Contribution
It presents a new corpus for name-language matching and compares the performance of different language models on name and short fragment identification tasks.
Findings
Names-only models perform comparably to general models on name identification.
Performance varies between isolated names and short document fragments.
The new corpus enables more accurate evaluation of language identification methods.
Abstract
This paper describes experiments on identifying the language of a single name in isolation or in a document written in a different language. A new corpus has been compiled and made available, matching names against languages. This corpus is used in a series of experiments measuring the performance of general language models and names-only language models on the language identification task. Conclusions are drawn from the comparison between using general language models and names-only language models and between identifying the language of isolated names and the language of very short document fragments. Future research directions are outlined.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
