Multi-level Representations for Fine-Grained Typing of Knowledge Base   Entities

Yadollah Yaghoobzadeh; Hinrich Sch\"utze

arXiv:1701.02025·cs.CL·January 18, 2017

Multi-level Representations for Fine-Grained Typing of Knowledge Base Entities

Yadollah Yaghoobzadeh, Hinrich Sch\"utze

PDF

TL;DR

This paper introduces a multi-level approach to representing entities using character, word, and entity embeddings, demonstrating that combining these levels enhances fine-grained entity typing accuracy.

Contribution

It proposes a novel multi-level entity representation framework that integrates character, word, and entity embeddings, and shows this improves over existing methods.

Findings

01

Joint multi-level representations outperform single-level baselines.

02

Adding entity descriptions further enhances representation quality.

03

Different learning methods excel at different representation levels.

Abstract

Entities are essential elements of natural language. In this paper, we present methods for learning multi-level representations of entities on three complementary levels: character (character patterns in entity names extracted, e.g., by neural networks), word (embeddings of words in entity names) and entity (entity embeddings). We investigate state-of-the-art learning methods on each level and find large differences, e.g., for deep learning models, traditional ngram features and the subword model of fasttext (Bojanowski et al., 2016) on the character level; for word2vec (Mikolov et al., 2013) on the word level; and for the order-aware model wang2vec (Ling et al., 2015a) on the entity level. We confirm experimentally that each level of representation contributes complementary information and a joint representation of all three levels improves the existing embedding based baseline for…

Tables4

Table 1. Table 1: Hyperparameters of different models. w 𝑤 w is the filter size. n 𝑛 n is the number of CNN feature maps for each filter size. d c subscript 𝑑 𝑐 d_{c} is the character embedding size. d h subscript 𝑑 ℎ d_{h} is the LSTM hidden state size. h mlp h\mbox{${}_{\hbox{mlp}}$} is the number of hidden units in the MLP.

model	hyperparameters
CLR(FF)	$d_{c} = 15, h_{m l p} = 600$
CLR(LSTM)	$d_{c} = 70, d_{h} = 70, h_{m l p} = 300$
CLR(BiLSTM)	$d_{c} = 50, d_{h} = 50, h_{m l p} = 200$
CLR(CNN)	$d_{c} = 10, w = [1, . ., 8]$
	$n = 100, h_{m l p} = 800$
CLR(NSL)	$h_{m l p} = 800$
BOW	$h_{m l p} = 200$
BOW+CLR(NSL)	$h_{m l p} = 300$
WWLR	$h_{m l p} = 400$
SWLR	$h_{m l p} = 400$
WWLR+CLR(CNN)	$w = [1, \dots, 7]$
	$d_{c} = 10, n = 50, h_{m l p} = 700$
SWLR+CLR(CNN)	$w = [1, \dots, 7]$
	$d_{c} = 10, n = 50, h_{m l p} = 700$
ELR(SKIP)	$h_{m l p} = 400$
ELR(SSKIP)	$h_{m l p} = 400$
ELR+CLR	$d_{c} = 10, w = [1, \dots, 7]$
	$n = 100, h_{m l p} = 700$
ELR+WWLR	$h_{m l p} = 600$
ELR+SWLR	$h_{m l p} = 600$
ELR+WWLR+CLR	$d_{c} = 10, w = [1, \dots, 7]$
	$n = 50, h_{m l p} = 700$
ELR+SWLR+CLR	$d_{c} = 10, w = [1, \dots, 7]$
	$n = 50, h_{m l p} = 700$
ELR+WWLR+CNN+TC	$d_{c} = 10, w = [1, \dots, 7]$
	$n = 50, h_{m l p} = 900$
ELR+SWLR+CNN+TC(MuLR)	$d_{c} = 10, w = [1, \dots, 7]$
	$n = 50, h_{m l p} = 900$
AVG-DES	$h_{m l p} = 400$
MuLR+AVG-DES	$d_{c} = 10, w = [1, \dots, 7]$
	$n = 50, h_{m l p} = 1000$

Table 2. Table 2: Accuracy (acc), micro (mic) and macro (mac) F 1 subscript 𝐹 1 F_{1} on test for all, head and tail entities.

		all entities			head entities			tail entities
		acc	mic	mac	acc	mic	mac	acc	mic	mac
1	MFT	.000	.041	.041	.000	.044	.044	.000	.038	.038
2	CLR(FORWARD)	.066	.379	.352	.067	.342	.369	.061	.374	.350
3	CLR(LSTM)	.121	.425	.396	.122	.433	.390	.116	.408	.391
4	CLR(BiLSTM)	.133	.440	.404	.129	.443	.394	.135	.428	.404
5	CLR(NSL)	.164	.484	.464	.157	.470	.443	.173	.483	.472
6	CLR(CNN)	.177	.494	.468	.171	.484	.450	.187	.489	.474
\hdashline7	BOW	.113	.346	.379	.109	.323	.353	.120	.356	.396
8	WWLR(SKIP)	.214	.581	.531	.293	.660	.634	.173	.528	.478
9	WWLR(SSKIP)	.223	.584	.543	.306	.667	.642	.183	.533	.494
10	SWLR	.236	.590	.554	.301	.665	.632	.209	.551	.522
\hdashline11	BOW+CLR(NSL)	.156	.487	.464	.157	.480	.452	.159	.485	.469
12	WWLR+CLR(CNN)	.257	.603	.568	.317	.668	.637	.235	.567	.538
13	SWLR+CLR(CNN)	.241	.594	.561	.295	.659	.628	.227	.560	.536
\hdashline14	ELR(SKIP)	.488	.774	.741	.551	.834	.815	.337	.621	.560
15	ELR(SSKIP)	.515	.796	.763	.560	.839	.819	.394	.677	.619
\hdashline16	AGG-FIGER	.320	.694	.660	.396	.762	.724	.220	.593	.568
17	ELR+CLR	.554	.816	.788	.580	.844	.825	.467	.733	.690
18	ELR+WWLR	.557	.819	.793	.582	.846	.827	.480	.749	.708
19	ELR+SWLR	.558	.820	.796	.584	.846	.829	.480	.751	.714
20	ELR+WWLR+CLR	.568	.823	.798	.590	.847	.829	.491	.755	.716
21	ELR+SWLR+CLR	.569	.824	.801	.590	.849	.831	.497	.760	.724
22	ELR+WWLR+CLR+TC	.572	.824	.801	.594	.849	.831	.499	.759	.722
23	ELR+SWLR+CLR+TC	.575	.826	.802	.597	.851	.831	.508	.762	.727

Table 3. Table 5: Micro average F 1 subscript 𝐹 1 F_{1} results of MuLR and description based model and their joint.

entities:	all	head	tail
AVG-DES	.773	.791	.745
MuLR	.825	.846	.757
MuLR+AVG-DES	.873	.877	.852

Table 4. Table 6: Significance-test results for accuracy measure for all, head and tail entities. If the result for the model in a row is significantly larger than the result for the model in a column, then the value in the corresponding (row,column) is * and otherwise is 0.

All entities

	Models	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22	23
01	MFT	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
02	CLR(FORWARD)	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
03	CLR(LSTM)	*	*	0	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
04	CLR(BiLSTM)	*	*	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
05	CLR(CNN)	*	*	*	*	0	*	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
06	CLR(NSL)	*	*	*	*	0	0	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
07	BOW	*	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
08	WWLR(SkipG)	*	*	*	*	*	*	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
09	WWLR(SSkipG)	*	*	*	*	*	*	*	*	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
10	SWLR	*	*	*	*	*	*	*	*	*	0	*	0	0	0	0	0	0	0	0	0	0	0	0
11	BOW+CLR(NSL)	*	*	*	*	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
12	WWLR+CLR(CNN)	*	*	*	*	*	*	*	*	*	*	*	0	*	0	0	0	0	0	0	0	0	0	0
13	SWLR+CLR(CNN)	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0	0	0	0	0	0
14	ELR(SkipG)	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	*	0	0	0	0	0	0	0
15	ELR(SSkipG)	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	*	0	0	0	0	0	0	0
16	AGG-FIGER	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0	0	0	0
17	ELR+CLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0
18	ELR+WWLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0
19	ELR+SWLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0
20	ELR+WWLR+CLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0
21	ELR+SWLR+CLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0
22	ELR+WWLR+CLR+TC	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0
23	ELR+SWLR+CLR+TC	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0

Head entities

	Models	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22	23
01	MFT	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
02	CLR(FORWARD)	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
03	CLR(LSTM)	*	*	0	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
04	CLR(BiLSTM)	*	*	0	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
05	CLR(CNN)	*	*	*	*	0	*	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
06	CLR(NSL)	*	*	*	*	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
07	BOW	*	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
08	WWLR(SkipG)	*	*	*	*	*	*	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
09	WWLR(SSkipG)	*	*	*	*	*	*	*	*	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
10	SWLR	*	*	*	*	*	*	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
11	BOW+CLR(NSL)	*	*	*	*	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
12	WWLR+CLR(CNN)	*	*	*	*	*	*	*	*	0	*	*	0	*	0	0	0	0	0	0	0	0	0	0
13	SWLR+CLR(CNN)	*	*	*	*	*	*	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
14	ELR(SkipG)	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	*	0	0	0	0	0	0	0
15	ELR(SSkipG)	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	*	0	0	0	0	0	0	0
16	AGG-FIGER	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0	0	0	0
17	ELR+CLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0
18	ELR+WWLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0
19	ELR+SWLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0
20	ELR+WWLR+CLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0
21	ELR+SWLR+CLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0
22	ELR+WWLR+CLR+TC	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0
23	ELR+SWLR+CLR+TC	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0

Tail entities

	Models	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22	23
01	MFT	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
02	CLR(FORWARD)	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
03	CLR(LSTM)	*	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
04	CLR(BiLSTM)	*	*	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
05	CLR(CNN)	*	*	*	*	0	*	*	*	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
06	CLR(NSL)	*	*	*	*	0	0	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
07	BOW	*	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
08	WWLR(SkipG)	*	*	*	*	0	0	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
09	WWLR(SSkipG)	*	*	*	*	0	0	*	0	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0
10	SWLR	*	*	*	*	*	*	*	*	*	0	*	0	0	0	0	0	0	0	0	0	0	0	0
11	BOW+CLR(NSL)	*	*	*	*	0	0	*	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
12	WWLR+CLR(CNN)	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	*	0	0	0	0	0	0	0
13	SWLR+CLR(CNN)	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0	0	0	0	0	0
14	ELR(SkipG)	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	*	0	0	0	0	0	0	0
15	ELR(SSkipG)	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	*	0	0	0	0	0	0	0
16	AGG-FIGER	*	*	*	*	*	*	*	*	*	0	*	0	0	0	0	0	0	0	0	0	0	0	0
17	ELR+CLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0
18	ELR+WWLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0
19	ELR+SWLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0	0
20	ELR+WWLR+CLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0	0	0
21	ELR+SWLR+CLR	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0
22	ELR+WWLR+CLR+TC	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0	0
23	ELR+SWLR+CLR+TC	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	*	0	0	0

Equations4

\big{[}P(t_{1}|e)\ldots P(t_{T}|e)\big{]}=\sigma\Big{(}\textbf{W}\mbox{${}_{\hbox{out}}$}f\big{(}\textbf{W}\mbox{${}_{\hbox{in}}$}\vec{v}(e)\big{)}\Big{)}

\big{[}P(t_{1}|e)\ldots P(t_{T}|e)\big{]}=\sigma\Big{(}\textbf{W}\mbox{${}_{\hbox{out}}$}f\big{(}\textbf{W}\mbox{${}_{\hbox{in}}$}\vec{v}(e)\big{)}\Big{)}

\sum_{t}{-\Big{(}m_{t}\log{p_{t}}+(1-m_{t})\log{(1-p_{t})}\Big{)}}

\sum_{t}{-\Big{(}m_{t}\log{p_{t}}+(1-m_{t})\log{(1-p_{t})}\Big{)}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsfastText

Full text

Multi-level Representations

for Fine-Grained Typing of Knowledge Base Entities

Yadollah Yaghoobzadeh and **Hinrich Schütze

**Center for Information and Language Processing

LMU Munich, Germany

[email protected]

Abstract

Entities are essential elements of natural language. In this paper, we present methods for learning multi-level representations of entities on three complementary levels: character (character patterns in entity names extracted, e.g., by neural networks), word (embeddings of words in entity names) and entity (entity embeddings). We investigate state-of-the-art learning methods on each level and find large differences, e.g., for deep learning models, traditional ngram features and the subword model of fasttext [Bojanowski et al., 2016] on the character level; for word2vec [Mikolov et al., 2013] on the word level; and for the order-aware model wang2vec [Ling et al., 2015a] on the entity level.

We confirm experimentally that each level of representation contributes complementary information and a joint representation of all three levels improves the existing embedding based baseline for fine-grained entity typing by a large margin. Additionally, we show that adding information from entity descriptions further improves multi-level representations of entities.

1 Introduction

Knowledge about entities is essential for understanding human language. This knowledge can be attributional (e.g., canFly, isEdible), type-based (e.g., isFood, isPolitician, isDisease) or relational (e.g, marriedTo, bornIn). Knowledge bases (KBs) are designed to store this information in a structured way, so that it can be queried easily. Examples of such KBs are Freebase [Bollacker et al., 2008], Wikipedia, Google knowledge graph and YAGO [Suchanek et al., 2007]. For automatic updating and completing the entity knowledge, text resources such as news, user forums, textbooks or any other data in the form of text are important sources. Therefore, information extraction methods have been introduced to extract knowledge about entities from text. In this paper, we focus on the extraction of entity types, i.e., assigning types to – or typing – entities. Type information can help extraction of relations by applying constraints on relation arguments.

We address a problem setting in which the following are given: a KB with a set of entities $E$ , a set of types $T$ and a membership function $m:E\times T\mapsto\{0,1\}$ such that $m(e,t)=1$ iff entity $e$ has type $t$ ; and a large corpus $C$ in which mentions of $E$ are annotated. In this setting, we address the task of fine-grained entity typing: we want to learn a probability function $S(e,t)$ for a pair of entity $e$ and type $t$ and based on $S(e,t)$ infer whether $m(e,t)=1$ holds, i.e., whether entity $e$ is a member of type $t$ .

We address this problem by learning a multi-level representation for an entity that contains the information necessary for typing it. One important source is the contexts in which the entity is used. We can take the standard method of learning embeddings for words and extend it to learning embeddings for entities. This requires the use of an entity linker and can be implemented by replacing all occurrences of the entity by a unique token. We refer to entity embeddings as entity-level representations. Previously, entity embeddings have been learned mostly using bag-of-word models like word2vec (e.g., by ?) and ?)). We show below that order information is critical for high-quality entity embeddings.

Entity-level representations are often uninformative for rare entities, so that using only entity embeddings is likely to produce poor results. In this paper, we use entity names as a source of information that is complementary to entity embeddings. We define an entity name as a noun phrase that is used to refer to an entity. We learn character and word level representations of entity names.

For the character-level representation, we adopt different character-level neural network architectures. Our intuition is that there is sub/cross word information, e.g., orthographic patterns, that is helpful to get better entity representations, especially for rare entities. A simple example is that a three-token sequence containing an initial like “P.” surrounded by two capitalized words (“Rolph P. Kugl”) is likely to refer to a person.

We compute the word-level representation as the sum of the embeddings of the words that make up the entity name. The sum of the embeddings accumulates evidence for a type/property over all constituents, e.g., a name containing “stadium”, “lake” or “cemetery” is likely to refer to a location. In this paper, we compute our word level representation with two types of word embeddings: (i) using only contextual information of words in the corpus, e.g., by word2vec [Mikolov et al., 2013] and (ii) using subword as well as contextual information of words, e.g., by Facebook’s recently released fasttext [Bojanowski et al., 2016].

In this paper, we integrate character-level and word-level with entity-level representations to improve the results of previous work on fine-grained typing of KB entities. We also show how descriptions of entities in a KB can be a complementary source of information to our multi-level representation to improve the results of entity typing, especially for rare entities.

Our main contributions in this paper are:

•

We propose new methods for learning entity representations on three levels: character-level, word-level and entity-level.

•

We show that these levels are complementary and a joint model that uses all three levels improves the state of the art on the task of fine-grained entity typing by a large margin.

•

We experimentally show that an order dependent embedding is more informative than its bag-of-word counterpart for entity representation.

We release our dataset and source codes: cistern.cis.lmu.de/figment2/.

2 Related Work

Entity representation. Two main sources of information used for learning entity representation are: (i) links and descriptions in KB, (ii) name and contexts in corpora. We focus on name and contexts in corpora, but we also include (Wikipedia) descriptions. We represent entities on three levels: entity, word and character. Our entity-level representation is similar to work on relation extraction [Wang et al., 2014, Wang and Li, 2016], entity linking [Yamada et al., 2016, Fang et al., 2016], and entity typing [Yaghoobzadeh and Schütze, 2015]. Our word-level representation with distributional word embeddings is similarly used to represent entities for entity linking [Sun et al., 2015] and relation extraction [Socher et al., 2013, Wang et al., 2014]. Novel entity representation methods we introduce in this paper are representation based on fasttext [Bojanowski et al., 2016] subword embeddings, several character-level representations, “order-aware” entity-level embeddings and the combination of several different representations into one multi-level representation.

Character-subword level neural networks. Character-level convolutional neural networks (CNNs) are applied by ?) to part of speech (POS) tagging, by ?), ?), and ?) to named entity recognition (NER), by ?) and ?) to sentiment analysis and text categorization, and by ?) to language modeling (LM). Character-level LSTM is applied by ?) to LM and POS tagging, by ?) to NER, by ?) to parsing morphologically rich languages, and by ?) to learning word embeddings. ?) learn word embeddings by representing words with the average of their character ngrams (subwords) embeddings. Similarly, ?) extends word2vec for Chinese with joint modeling with characters.

Fine-grained entity typing. Our task is to infer fine-grained types of KB entities. KB completion is an application of this task. ?)’s FIGMENT system addresses this task with only contextual information; they do not use character-level and word-level features of entity names. ?) and ?) also address a similar task, but they rely on entity descriptions in KBs, which in many settings are not available. The problem of Fine-grained mention typing (FGMT) [Yosef et al., 2012, Ling and Weld, 2012, Yogatama et al., 2015, Del Corro et al., 2015, Shimaoka et al., 2016, Ren et al., 2016] is related to our task. FGMT classifies single mentions of named entities to their context dependent types whereas we attempt to identify all types of a KB entity from the aggregation of all its mentions. FGMT can still be evaluated in our task by aggregating the mention level decisions but as we will show in our experiments for one system, i.e., FIGER [Ling and Weld, 2012], our entity embedding based models are better in entity typing.

3 Fine-grained entity typing

Given (i) a KB with a set of entities $E$ , (ii) a set of types $T$ , and (iii) a large corpus $C$ in which mentions of $E$ are linked, we address the task of fine-grained entity typing [Yaghoobzadeh and Schütze, 2015]: predict whether entity $e$ is a member of type $t$ or not. To do so, we use a set of training examples to learn $P(t|e)$ : the probability that entity $e$ has type $t$ . These probabilities can be used to assign new types to entities covered in the KB as well as typing unknown entities.

We learn $P(t|e)$ with a general architecture; see Figure 1. The output layer has size $|T|$ . Unit $t$ of this layer outputs the probability for type $t$ . “Entity Representation” ( $\vec{v}(e)$ ) is the vector representation of entity $e$ – we will describe in detail in the rest of this section what forms $\vec{v}(e)$ takes. We model $P(t|e)$ as a multi-label classification, and train a multilayer perceptron (MLP) with one hidden layer:

[TABLE]

where $\textbf{W}\mbox{$ {}{\hbox{in}} $}\in\mathbb{R}^{h\times d}$ is the weight matrix from $\vec{v}(e)\in\mathbb{R}^{d}$ to the hidden layer with size $h$ . $f$ is the rectifier function. $\textbf{W}\mbox{$ {}{\hbox{out}} $}\in\mathbb{R}^{|T|\times h}$ is the weight matrix from hidden layer to output layer of size $|T|$ . $\sigma$ is the sigmoid function. Our objective is binary cross entropy summed over types:

[TABLE]

where $m_{t}$ is the truth and $p_{t}$ the prediction.

The key difficulty when trying to compute $P(t|e)$ is in learning a good representation for entity $e$ . We make use of contexts and name of $e$ to represent its feature vector on the three levels of entity, word and character.

3.1 Entity-level representation

Distributional representations or embeddings are commonly used for words. The underlying hypothesis is that words with similar meanings tend to occur in similar contexts [Harris, 1954] and therefore cooccur with similar context words. We can extend the distributional hypothesis to entities (cf. ?), ?)): entities with similar meanings tend to have similar contexts. Thus, we can learn a $d$ dimensional embedding $\vec{v}(e)$ of entity $e$ from a corpus in which all mentions of the entity have been replaced by a special identifier. We refer to these entity vectors as the entity level representation (ELR).

In previous work, order information of context words (relative position of words in the contexts) was generally ignored and objectives similar to the SkipGram (henceforth: SKIP) model were used to learn $\vec{v}(e)$ . However, the bag-of-word context is difficult to distinguish for pairs of types like (restaurant,food) and (author,book). This suggests that using order aware embedding models is important for entities. Therefore, we apply ?)’s extended version of SKIP, Structured SKIP (SSKIP). It incorporates the order of context words into the objective. We compare it with SKIP embeddings in our experiments.

3.2 Word-level representation

Words inside entity names are important sources of information for typing entities. We define the word-level representation (WLR) as the average of the embeddings of the words that the entity name contains $\vec{v}(e)=1/n\sum_{i=1}^{n}\vec{v}(w_{i})$ where $\vec{v}(w_{i})$ is the embedding of the $i\mbox{$ {}^{\hbox{th}} $}$ word of an entity name of length $n$ . We opt for simple averaging since entity names often consist of a small number of words with clear semantics. Thus, averaging is a promising way of combining the information that each word contributes.

The word embedding, $\vec{w}$ , itself can be learned from models with different granularity levels. Embedding models that consider words as atomic units in the corpus, e.g., SKIP and SSKIP, are word-level. On the other hand, embedding models that represent words with their character ngrams, e.g., fasttext [Bojanowski et al., 2016], are subword-level. Based on this, we consider and evaluate word-level WLR (WWLR) and subword-level WLR (SWLR) in this paper.111Subword models have properties of both character-level models (subwords are character ngrams) and of word-level models (they do not cross boundaries between words). They probably could be put in either category, but in our context fit the word-level category better because we see the granularity level with respect to the entities and not words.

3.3 Character-level representation

For computing the character level representation (CLR), we design models that try to type an entity based on the sequence of characters of its name. Our hypothesis is that names of entities of a specific type often have similar character patterns. Entities of type ethnicity often end in “ish” and “ian”, e.g., “Spanish” and “Russian”. Entities of type medicine often end in “en”: “Lipofen”, “acetaminophen”. Also, some types tend to have specific cross-word shapes in their entities, e.g., person names usually consist of two words, or music names are usually long, containing several words.

The first layer of the character-level models is a lookup table that maps each character to an embedding of size $d_{c}$ . These embeddings capture similarities between characters, e.g., similarity in type of phoneme encoded (consonant/vowel) or similarity in case (lower/upper). The output of the lookup layer for an entity name is a matrix $C\in\mathbb{R}^{l\times d_{c}}$ where $l$ is the maximum length of a name and all names are padded to length $l$ . This length $l$ includes special start/end characters that bracket the entity name.

We experiment with four architectures to produce character-level representations in this paper: FORWARD (direct forwarding of character embeddings), CNNs, LSTMs and BiLSTMs. The output of each architecture then takes the place of the entity representation $\vec{v}(e)$ in Figure 1.

FORWARD simply concatenates all rows of matrix $C$ ; thus, $\vec{v}(e)\in\mathbb{R}^{d_{c}*l}$ .

The CNN uses $k$ filters of different window widths $w$ to narrowly convolve $C$ . For each filter $H\in\mathbb{R}^{d_{c}\times w}$ , the result of the convolution of $H$ over matrix $C$ is feature map $f\in\mathbb{R}^{l-w+1}$ :

$f[i]=\mbox{rectifier}(C_{[:,i:i+w-1]}\odot H+b)$

where rectifier is the activation function, $b$ is the bias, $C_{[:,i:i+w-1]}$ are the columns $i$ to $i+w-1$ of $C$ , $1\leq w\leq 10$ are the window widths we consider and $\odot$ is the sum of element-wise multiplication. Max pooling then gives us one feature for each filter. The concatenation of all these features is our representation: $\vec{v}(e)\in\mathbb{R}^{k}$ . An example CNN architecture is show in Figure 2.

The input to the LSTM is the character sequence in matrix $C$ , i.e., $x_{1},\dots,x_{l}\in\mathbb{R}^{d_{c}}$ . It generates the state sequence $h_{1},...,h_{l+1}$ and the output is the last state $\vec{v}(e)\in\mathbb{R}^{d_{h}}$ .222We use Blocks [van Merriënboer et al., 2015].

The BiLSTM consists of two LSTMs, one going forward, one going backward. The first state of the backward LSTM is initialized as $h_{l+1}$ , the last state of the forward LSTM. The BiLSTM entity representation is the concatenation of last states of forward and backward LSTMs, i.e., $\vec{v}(e)\in\mathbb{R}^{2*d_{h}}$ .

3.4 Multi-level representations

Our different levels of representations can give complementary information about entities.

WLR and CLR. Both WLR models, SWLR and WWLR, do not have access to the cross-word character ngrams of entity names while CLR models do. Also, CLR is task specific by training on the entity typing dataset while WLR is generic. On the other hand, WWLR and SWLR models have access to information that CLR ignores: the tokenization of entity names into words and embeddings of these words. It is clear that words are particularly important character sequences since they often correspond to linguistic units with clearly identifiable semantics – which is not true for most character sequences. For many entities, the words they contain are a better basis for typing than the character sequence. For example, even if “nectarine” and “compote” did not occur in any names in the training corpus, we can still learn good word embeddings from their non-entity occurrences. This then allows us to correctly type the entity “Aunt Mary’s Nectarine Compote” as food based on the sum of the word embeddings.

WLR/CLR and ELR. Representations from entity names, i.e., WLR and CLR, by themselves are limited because many classes of names can be used for different types of entities; e.g., person names do not contain hints as to whether they are referring to a politician or athlete. In contrast, the ELR embedding is based on an entity’s contexts, which are often informative for each entity and can distinguish politicians from athletes. On the other hand, not all entities have sufficiently many informative contexts in the corpus. For these entities, their name can be a complementary source of information and character/word level representations can increase typing accuracy.

Thus, we introduce joint models that use combinations of the three levels. Each multi-level model concatenates several levels. We train the constituent embeddings as follows. WLR and ELR are computed as described above and are not changed during training. CLR – produced by one of the character-level networks described above – is initialized randomly and then tuned during training. Thus, it can focus on complementary information related to the task that is not already present in other levels. The schematic diagram of our multi-level representation is shown in Figure 3.

4 Experimental setup and results

4.1 Setup

Entity datasets and corpus. We address the task of fine-grained entity typing and use ?)’s FIGMENT dataset333cistern.cis.lmu.de/figment/ for evaluation. The FIGMENT corpus is part of a version of ClueWeb in which Freebase entities are annotated using FACC1 [URL, 2016b, Gabrilovich et al., 2013]. The FIGMENT entity datasets contain 200,000 Freebase entities that were mapped to 102 FIGER types [Ling and Weld, 2012]. We use the same train (50%), dev (20%) and test (30%) partitions as ?) and extract the names from mentions of dataset entities in the corpus. We take the most frequent name for dev and test entities and three most frequent names for train (each one tagged with entity types).

Adding parent types to refine entity dataset. FIGMENT ignores that FIGER is a proper hierarchy of types; e.g., while hospital is a subtype of building according to FIGER, there are entities in FIGMENT that are hospitals, but not buildings.444See github.com/xiaoling/figer for FIGER Therefore, we modified the FIGMENT dataset by adding for each assigned type (e.g., hospital) its parents (e.g., building). This makes FIGMENT more consistent and eliminates spurious false negatives (building in the example).

We now describe our baselines: (i) BOW & NSL: hand-crafted features, (ii) FIGMENT [Yaghoobzadeh and Schütze, 2015] and (iii) adapted version of FIGER [Ling and Weld, 2012].

We implement the following two feature sets from the literature as a hand-crafted baseline for our character and word level models. (i) BOW: individual words of entity name (both as-is and lowercased); (ii) NSL (ngram-shape-length): shape and length of the entity name (cf. ?)), character $n$ -grams, $1\leq n\leq n\mbox{$ {}{\hbox{max}} $},n\mbox{$ {}{\hbox{max}} $}=5$ (we also tried $n\mbox{$ {}_{\hbox{max}} $}=7$ , but results were worse on dev) and normalized character $n$ -grams: lowercased, digits replaced by “7”, punctuation replaced by “.”. These features are represented as a sparse binary vector $\vec{v}(e)$ that is input to the architecture in Figure 1.

FIGMENT is the model for entity typing presented by ?). The authors only use entity-level representations for entities trained by SkipGram, so the FIGMENT baseline corresponds to the entity-level result shown as ELR(SKIP) in the tables.

The third baseline is using an existing mention-level entity typing system, FIGER [Ling and Weld, 2012]. FIGER uses a wide variety of features on different levels (including parsing-based features) from contexts of entity mentions as well as the mentions themselves and returns a score for each mention-type instance in the corpus. We provide the ClueWeb/FACC1 segmentation of entities, so FIGER does not need to recognize entities.555Mention typing is separated from recognition in FIGER model. So it can use our segmentation of entities. We use the trained model provided by the authors and normalize FIGER scores using softmax to make them comparable for aggregation. We experimented with different aggregation functions (including maximum and k-largest-scores for a type), but we use the average of scores since it gave us the best result on dev. We call this baseline AGG-FIGER.

Distributional embeddings. For WWLR and ELR, we use SkipGram model in word2vec and SSkip model in wang2vec [Ling et al., 2015a] to learn embeddings for words, entities and types. To obtain embeddings for all three in the same space, we process ClueWeb/FACC1 as follows. For each sentence $s$ , we add three copies: $s$ itself, a copy of $s$ in which each entity is replaced with its Freebase identifier (MID) and a copy in which each entity (not test entities though) is replaced with an ID indicating its notable type. The resulting corpus contains around 4 billion tokens and 1.5 billion types.

We run SKIP and SSkip with the same setup (200 dimensions, 10 negative samples, window size 5, word frequency threshold of 100)666The threshold does not apply for MIDs. on this corpus to learn embeddings for words, entities and FIGER types. Having entities and types in the same vector space, we can add another feature vector $\vec{v}(e)\in\mathbb{R}^{|T|}$ (referred to as TC below): for each entity, we compute cosine similarity of its entity vector with all type vectors.

For SWLR, we use fasttext777github.com/facebookresearch/fastText to learn word embeddings from the ClueWeb/FACC1 corpus. We use similar settings as our WWLR SKIP and SSkip embeddings and keep the defaults of other hyperparameters. Since the trained model of fasttext is applicable for new words, we apply the model to get embeddings for the filtered rare words as well.

Our hyperparameter values are given in Table 1. The values are optimized on dev. We use AdaGrad and minibatch training. For each experiment, we select the best model on dev.

We use these evaluation measures: (i) accuracy: an entity is correct if all its types and no incorrect types are assigned to it; (ii) micro average $F_{1}$ : $F_{1}$ of all type-entity assignment decisions; (iii) entity macro average $F_{1}$ : $F_{1}$ of types assigned to an entity, averaged over entities; (iv) type macro average $F_{1}$ : $F_{1}$ of entities assigned to a type, averaged over types.

The assignment decision is based on thresholding the probability function $P(t|e)$ . For each model and type, we select the threshold that maximizes $F_{1}$ of entities assigned to the type on dev.

4.2 Results

Table 4 gives results on the test entities for all (about 60,000 entities), head (frequency $>$ 100; about 12,200) and tail (frequency $<$ 5; about 10,000). MFT (line 1) is the most frequent type baseline that ranks types according to their frequency in the train entities. Each level of representation is separated with dashed lines, and – unless noted otherwise – the best of each level is joined in multi level representations.888For accuracy measure: in the following ordered lists of sets, $A$$<$$B$ means that all members (row numbers in Table 4) of $A$ are significantly worse than all members of $B$ : {1} $<$ {2} $<$ {3, …, 11} $<$ {12,13} $<$ {14,15,16} $<$ {17, …, 23}. Test of equal proportions, $\alpha<$ 0.05. See Table 6 in the appendix for more details.

Character-level models are on lines 2-6. The order of systems is: CNN $>$ NSL $>$ BiLSTM $>$ LSTM $>$ FORWARD. The results show that complex neural networks are more effective than simple forwarding. BiLSTM works better than LSTM, confirming other related work. CNNs probably work better than LSTMs because there are few complex non-local dependencies in the sequence, but many important local features. CNNs with maxpooling can more straightforwardly capture local and position-independent features. CNN also beats NSL baseline; a possible reason is that CNN – an automatic method of feature learning – is more robust than hand engineered feature based NSL. We show more detailed results in Section 4.3.

Word-level models are on lines 7-10. BOW performs worse than WWLR because it cannot deal well with sparseness. SSKIP uses word order information in WWLR and performs better than SKIP. SWLR uses subword information and performs better than WWLR, especially for tail entities. Integrating subword information improves the quality of embeddings for rare words and mitigates the problem of unknown words.

Joint word-character level models are on lines 11-13. WWLR+CLR(CNN) and SWLR+CLR(CNN) beat the component models. This confirms our underlying assumption in designing the complementary multi-level models. BOW problem with rare words does not allow its joint model with NSL to work better than NSL. WWLR+CLR(CNN) works better than BOW+CLR(NSL) by 10% micro $F_{1}$ , again due to the limits of BOW compared to WWLR. Interestingly WWLR+CLR works better than SWLR+CLR and this suggests that WWLR is indeed richer than SWLR when CLR mitigates its problem with rare/unknown words

Entity-level models are on lines 14–15 and they are better than all previous models on lines 1–13. This shows the power of entity-level embeddings. In Figure 4, a t-SNE [Van der Maaten and Hinton, 2008] visualization of ELR(SKIP) embeddings using different colors for entity types shows that entities of the same type are clustered together. SSKIP works marginally better than SKIP for ELR, especially for tail entities, confirming our hypothesis that order information is important for a good distributional entity representation. This is also confirming the results of ?), where they also get better entity typing results with SSKIP compared to SKIP. They propose to use entity typing as an extrinsic evaluation for embedding models.

Joint entity, word, and character level models are on lines 16-23. The AGG-FIGER baseline works better than the systems on lines 1-13, but worse than ELRs. This is probably due to the fact that AGG-FIGER is optimized for mention typing and it is trained using distant supervision assumption. Parallel to our work, ?) optimize a mention typing model for our entity typing task by introducing multi instance learning algorithms, resulting comparable performance to ELR(SKIP). We will investigate their method in future.

Joining CLR with ELR (line 17) results in large improvements, especially for tail entities (5% micro $F_{1}$ ). This demonstrates that for rare entities, contextual information is often not sufficient for an informative representation, hence name features are important. This is also true for the joint models of WWLR/SWLR and ELR (lines 18-19). Joining WWLR works better than CLR, and SWLR is slightly better than WWLR. Joint models of WWLR/SWLR with ELR+CLR gives more improvements, and SWLR is again slightly better than WWLR. ELR+WWLR+CLR and ELR+SWLR+CLR, are better than their two-level counterparts, again confirming that these levels are complementary.

We get a further boost, especially for tail entities, by also including TC (type cosine) in the combinations (lines 22-23). This demonstrates the potential advantage of having a common representation space for entities and types. Our best model, ELR+SWLR+CLR+TC (line 22), which we refer to as MuLR in the other tables, beats our initial baselines (ELR and AGG-FIGER) by large margins, e.g., in tail entities improvements are more than 8% in micro F1.

Table 4 shows **type macro $F_{1}$ ** for MuLR (ELR+SWLR+CLR+TC) and two baselines. There are 11 head types (those with $\geq$ 3000 train entities) and 36 tail types (those with $<$ 200 train entities). These results again confirm the superiority of our multi-level models over the baselines: AGG-FIGER and ELR, the best single-level model baseline.

4.3 Analysis

Unknown vs. known entities. To analyze the complementarity of character and word level representations, as well as more fine-grained comparison of our models and the baselines, we divide test entities into known entities – at least one word of the entity’s name appears in a train entity – and unknown entities (the complement). There are 45,000 (resp. 15,000) known (resp. unknown) test entities.

Table 4 shows that the CNN works only slightly better (by 0.3%) than NSL on known entities, but works much better on unknown entities (by 3.3%), justifying our preference for deep learning CLR models. As expected, BOW works relatively well for known entities and really poorly for unknown entities. SWLR beats CLR models as well as BOW. The reason is that in our setup, word embeddings are induced on the entire corpus using an unsupervised algorithm. Thus, even for many words that did not occur in train, SWLR has access to informative representations of words. The joint model, SWLR+CLR(CNN), is significantly better than BOW+CLR(NSL) again due to limits of BOW. SWLR+CLR(CNN) is better than SWLR in unknown entities.

Case study of living-thing. To understand the interplay of different levels better, we perform a case study of the type living-thing. Living beings that are not humans belong to this type.

WLRs incorrectly assign “Walter Leaf” (person) and “Along Came A Spider” (music) to living-thing because these names contain a word referring to a living-thing (“leaf”, “spider”), but the entity itself is not a living-thing. In these cases, the averaging of embeddings that WLR performs is misleading. The CLR(CNN) types these two entities correctly because their names contain character ngram/shape patterns that are indicative of person and music.

ELR incorrectly assigns “Zumpango” (city) and “Lake Kasumigaura” (location) to living-thing because these entities are rare and words associated with living things (e.g., “wildlife”) dominate in their contexts. However, CLR(CNN) and WLR enable the joint model to type the two entites correctly: “Zumpango” because of the informative suffix “-go” and “Lake Kasumigaura” because of the informative word “Lake”.

While some of the remaining errors of our best system MuLR are due to the inherent difficulty of entity typing (e.g., it is difficult to correctly type a one-word entity that occurs once and whose name is not informative), many other errors are due to artifacts of our setup. First, ClueWeb/FACC1 is the result of an automatic entity linking system and any entity linking errors propagate to our models. Second, due to the incompleteness of Freebase [Yaghoobzadeh and Schütze, 2015], many entities in the FIGMENT dataset are incompletely annotated, resulting in correctly typed entities being evaluated as incorrect.

Adding another source: description-based embeddings. While in this paper, we focus on the contexts and names of entities, there is a textual source of information about entities in KBs which we can also make use of: descriptions of entities. We extract Wikipedia descriptions of FIGMENT entities filtering out the entities ( $\sim$ 40,000 out of $\sim$ 200,000) without description.

We then build a simple entity representation by averaging the embeddings of the top $k$ words (wrt tf-idf) of the description (henceforth, AVG-DES).999 $k$ = 20 gives the best results on dev. This representation is used as input in Figure 1 to train the MLP. We also train our best multi-level model as well as the joint of the two on this smaller dataset. Since the descriptions are coming from Wikipedia, we use 300-dimensional Glove [URL, 2016a] embeddings pretrained on Wikipdia+Gigaword to get more coverage of words. For MuLR, we still use the embeddings we trained before.

Results are shown in Table 5. While for head entities, MuLR works marginally better, the difference is very small in tail entities. The joint model of the two (by concatenation of vectors) improves the micro F1, with clear boost for tail entities. This suggests that for tail entities, the contextual and name information is not enough by itself and some keywords from descriptions can be really helpful. Integrating more complex description-based embeddings, e.g., by using CNN [Xie et al., 2016], may improve the results further. We leave it for future work.

5 Conclusion

In this paper, we have introduced representations of entities on different levels: character, word and entity. The character level representation is learned from the entity name. The word level representation is computed from the embeddings of the words $w_{i}$ in the entity name where the embedding of $w_{i}$ is derived from the corpus contexts of $w_{i}$ . The entity level representation of entity $e_{i}$ is derived from the corpus contexts of $e_{i}$ . Our experiments show that each of these levels contributes complementary information for the task of fine-grained typing of entities. The joint model of all three levels beats the state-of-the-art baseline by large margins. We further showed that extracting some keywords from Wikipedia descriptions of entities, when available, can considerably improve entity representations, especially for rare entities. We believe that our findings can be transferred to other tasks where entity representation matters.

Acknowledgments. This work was supported by DFG (SCHU 2246/8-2).

Appendix A Supplementary Material

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Ballesteros et al., 2015] Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with lstms. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 349–359, Lisbon, Portugal, September. Association for Computational Linguistics.
2[Bojanowski et al., 2016] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. Co RR , abs/1607.04606.
3[Bollacker et al., 2008] Kurt D. Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008 , pages 1247–1250.
4[Cao and Rei, 2016] Kris Cao and Marek Rei. 2016. A joint model for word embedding and word morphology. In Proceedings of the 1st Workshop on Representation Learning for NLP , pages 18–26, Berlin, Germany, August. Association for Computational Linguistics.
5[Chen et al., 2015] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huan-Bo Luan. 2015. Joint learning of character and word embeddings. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015 , pages 1236–1242.
6[Chiu and Nichols, 2016] Jason Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics , 4:357–370.
7[Del Corro et al., 2015] Luciano Del Corro, Abdalghani Abujabal, Rainer Gemulla, and Gerhard Weikum. 2015. Finet: Context-aware fine-grained named entity typing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 868–878, Lisbon, Portugal, September. Association for Computational Linguistics.
8[dos Santos and Guimarães, 2015] Cícero Nogueira dos Santos and Victor Guimarães. 2015. Boosting named entity recognition with neural character embeddings. Co RR , abs/1505.05008.