Global Entity Ranking Across Multiple Languages
Prantik Bhattacharyya, Nemanja Spasojevic

TL;DR
This paper develops a multilingual entity ranking system leveraging Wikipedia and Freebase, achieving high precision and F1 scores across 27 million entities, facilitating future research in cross-lingual knowledge organization.
Contribution
It introduces a novel model for global entity ranking across multiple languages using large-scale knowledge bases and a ground-truth dataset, with comprehensive performance evaluation.
Findings
Ranks 27 million entities with 75% precision
Achieves 48% F1 score in multilingual ranking
Provides open access to ranked entity lists
Abstract
We present work on building a global long-tailed ranking of entities across multiple languages using Wikipedia and Freebase knowledge bases. We identify multiple features and build a model to rank entities using a ground-truth dataset of more than 10 thousand labels. The final system ranks 27 million entities with 75% precision and 48% F1 score. We provide performance evaluation and empirical evidence of the quality of ranking across languages, and open the final ranked lists for future research.
| Feature | P | R | F1 | C | RMSE | |
|---|---|---|---|---|---|---|
| Wikipedia | Page Rank | 0.59 | 0.05 | 0.09 | 0.164 | 1.54 |
| Outlink Count | 0.55 | 0.13 | 0.21 | 0.164 | 2.09 | |
| Inlink Count | 0.62 | 0.12 | 0.20 | 0.164 | 1.82 | |
| In Out Ratio | 0.75 | 0.19 | 0.31 | 0.164 | 1.54 | |
| Category Count | 0.65 | 0.21 | 0.36 | 0.164 | 1.89 | |
| Freebase | Subject # | 0.28 | 0.06 | 0.10 | 1.000 | 2.39 |
| Subject Types # | 0.42 | 0.10 | 0.16 | 1.000 | 2.25 | |
| Object # | 0.62 | 0.12 | 0.20 | 0.973 | 2.00 | |
| Object Types # | 0.46 | 0.11 | 0.17 | 0.973 | 2.25 | |
| Klout Score | 0.57 | 0.11 | 0.17 | 0.004 | 2.32 | |
| - All Feat. | 0.75 | 0.37 | 0.48 | 1.00 | 1.15 | |
| Entity | Image | EN | AR | ES | FR | IT |
|---|---|---|---|---|---|---|
| Vogue |
|
2 | 6,173 | 200 | 2,341 | 62 |
| \pbox20cmWorld | ||||||
| Bank |
|
322 | 103 | 3,747 | 2,758 | 5,704 |
| Morocco |
|
1,277 | 2 | 527 | 544 | 232 |
| \pbox10cmDonald | ||||||
| Duck |
|
10,001 | 9,494 | 7,444 | 10,380 | 4,575 |
| Balkans |
|
36,753 | 109 | 17,456 | 9,383 | 2,854 |
| Bed |
|
109,686 | 23,809 | 68,180 | 66,859 | 52,713 |
| \pbox20cmBunk | ||||||
| Bed |
|
992,576 | 64,399 | 330,669 | 906,988 | 416,292 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\permission
©2017 International World Wide Web Conference Committee (IW3C2), published under Creative Commons CC BY 4.0 License.
\toappearWWW 2017 Companion, April 3-7, 2017, Perth, Australia.
Global Entity Ranking Across Multiple Languages
Prantik Bhattacharyya
Nemanja Spasojevic
Lithium Technologies | Klout
San Francisco, CA
{prantik.bhattacharyya, nemanja.spasojevic}@lithium.com
Abstract
We present work on building a global long-tailed ranking of entities across multiple languages using Wikipedia and Freebase knowledge bases. We identify multiple features and build a model to rank entities using a ground-truth dataset of more than thousand labels. The final system ranks million entities with precision and F1 score. We provide performance evaluation and empirical evidence of the quality of ranking across languages, and open the final ranked lists for future research.
keywords:
entity ranking; entity extraction; knowledge base;
††conference: WWW 2017, April 3–7, 2017, Perth, Australia.
1 Introduction
In the past decade, a number of openly available Knowledge Bases (KBs) have emerged. The most popular ones include Freebase, Wikipedia, and Yago, containing around 48M, 25M, and 10M entities respectively. Many of the entities overlap across the KBs. In NLP entity linking 111also known as named entity linking (NEL), named entity disambiguation (NED) or named entity recognition and disambiguation (NERD), the task is to link mentioned entities within text to their identity within the KB. A foundational part of setting up a real-time entity linking system is to choose which entities to consider, as memory constraints prohibit considering the entire knowledge base [1]. Additionally, some entities may not be of relevance. In order to maximize quality of the NLP entity linking system, we need to include as many important entities as possible.
In this paper we identify a collection of features to perform scoring and ranking of the entities. We also introduce the ground truth data set that we use to train and apply the ranking function.
2 Related Work
A large body of previous work has addressed ranking entities in terms of temporal popularity, as well as in the context of a query; however, little study has been done in terms of building the global rank of entities within a KB. Temporal entity importance on Twitter was studied by Pedro et. al. [4]. In [2], authors propose a hybrid model of entity ranking and selection in the context of displaying the most important entities for a given constraint while eliminating redundant entities. Entity ranking of Wikipedia entities in the context of a query, has been done using link structure and categories [5], as well as graph methods and web search [6].
3 Our Approach
Given KB, we want to build a global long-tailed ranking of entities in order of socially recognizable importance. When building the NLP entity linking system, the top ranked entities from KB should yield maximum perceived quality by casual observers.
3.1 Data Set
We collected a labeled data set by selecting entities. We randomly sampled as well as added some important entities, to balance the skewed ratio that KBs have of important / non-important entries. Each evaluator had to score the entities on scale 1 to 5; 5 being most important. Seven evaluators used the following guidelines regarding importance:
Public Persons
important if currently major pro athletes, serving politicians, etc. If no longer active, important if influential (e.g. Muhammad Ali, Tony Blair).
Locations
look at population (e.g. Albany, California vs. Toronto, Canada), historical significance (Waterloo).
Dates
unimportant unless shorthand for a holiday or event (4th of July, 9/11).
Newspapers
important, especially high-circulation ones (WSJ).
Sports Teams
important if in pro league.
Schools
important if recognised globally.
Films & Song
major franchises and influential classics are important – more obscure are often not.
Laws
important if they enacted social change (Loving v. Virginia, Roe v. Wade), unimportant otherwise.
Disambiguators
entities that disambiguate are important because we want them in the dictionary (Apple, Inc. and Apple Fruit).
3.2 Features and Scoring
Features were derived from Freebase and Wikipedia sources. They capture popularity within Wikipedia links, and how important an entity is within Freebase. Some signals used are page rank, link in/out counts and ratio, number of categories a page belongs to in Wikipedia. We also use the number of objects, a given entity is connected to, i.e., object and object type count features, as well as the number of times a given entity was an object with respect to another entity, i.e., subject and subject type count features. We also extract social media identities mentioned in an entity’s KB and use their Klout score [3] as a feature. The full set of features derived as well as their performance is listed in Table 1.
We model the evaluator’s score using simple linear regression. The feature vector for an entity is represented as: where is the feature value associated with a specific feature . Normalized feature values are denoted by . Features are normalized as: . Importance score for an entity is denoted by and is computed as the dot product of a weight vector and the normalized feature vector: (1). Weight vector is computed with supervised learning techniques, using labeled ground truth data (train/test split of 80/20).
4 Experiments
Table 1 shows precision, recall, F1 and the population coverage for the full list of features and the final system. The importance score was calculated using Eq.3.2 where final score was rounded to an integer value so it can be compared against the labels from ground-truth data.
We observe that Wikipedia features have the highest precision among all the features. The Freebase features have the highest coverage values. The Klout score feature also has one of the highest individual precision values. While this feature has the lowest coverage, it helps boost the final score and floats up a few relevant entities for final system application in social media platforms. We also look at root mean squared error (RMSE) of the entity scores against assigned labels. The final model shows the lowest RMSE value.
We also plot the distribution of entity types in the top million ranked entities and the unranked list for the English language. of entities are of type ‘person’ in the global list while the top ranked list contains entities of type ‘person’. The percentage of ‘MISC’ entity types drop from to . These difference in coverage highlight that entities are ranked relevantly in the corpus.
In Table 2, we provide examples of entities with their ranks in a particular language. We see that the entity ranks are regionally sensitive in the context of their language, e. g. ‘Morocco’ is ranked in the ranking for ‘Arabic’ language. We also observe the rankings are sensitive with respect to the specificity of the entity, for example ‘bunk bed’ is ranked magnitudally lower than the more generic entity ‘bed’.
5 Summary
We make the ranked list of top entities available as an open source data set at https://github.com/klout/opendata. To conclude, in this work, we built a global ranking of entities across multiple languages combining features from multiple knowledge bases. We also found that combination of multiple features yields the best results. Future work in this direction is to include new signals such as Wikipedia page view statistics and edit history.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. Bhargava, N. Spasojevic, and G. Hu. High-throughput and language-agnostic entity disambiguation and linking on user generated data. In Proceedings of the 26th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
- 2[2] A. Gionis, T. Lappas, and E. Terzi. Estimating entity importance via counting set covers. In 18th Intl. Conf. on Knowledge Discovery and Data Mining , 2012.
- 3[3] A. Rao, N. Spasojevic, Z. Li, and T. Dsouza. Klout score: Measuring influence across multiple social networks. In IEEE Intl. Conf. on Big Data , 2015.
- 4[4] P. Saleiro and C. Soares. Learning from the news: Predicting entity popularity on twitter. In International Symposium on Intelligent Data Analysis , 2016.
- 5[5] A.-M. Vercoustre, J. A. Thom, and J. Pehcevski. Entity ranking in wikipedia. In ACM symposium on Applied computing , 2008.
- 6[6] H. Zaragoza, H. Rode, P. Mika, J. Atserias, M. Ciaramita, and G. Attardi. Ranking very many typed entities on wikipedia. In 16th ACM conference on Conference on Information and Knowledge Management , 2007.
