Question Answering on Knowledge Bases and Text using Universal Schema   and Memory Networks

Rajarshi Das; Manzil Zaheer; Siva Reddy; Andrew McCallum

arXiv:1704.08384·cs.CL·April 28, 2017

Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks

Rajarshi Das, Manzil Zaheer, Siva Reddy, Andrew McCallum

PDF

TL;DR

This paper introduces a universal schema approach combined with memory networks to improve question answering by integrating knowledge bases and unstructured text, achieving state-of-the-art results.

Contribution

It extends universal schema to natural language QA and employs memory networks for end-to-end training on combined KB and text data.

Findings

01

Outperforms KB or text alone in QA tasks

02

Achieves 8.5 F1 points improvement over previous state-of-the-art

03

Effective end-to-end training on question-answer pairs

Abstract

Existing question answering methods infer answers either from a knowledge base or from raw text. While knowledge base (KB) methods are good at answering compositional questions, their performance is often affected by the incompleteness of the KB. Au contraire, web text contains millions of facts that are absent in the KB, however in an unstructured form. {\it Universal schema} can support reasoning on the union of both structured KBs and unstructured text by aligning them in a common embedded space. In this paper we extend universal schema to natural language question answering, employing \emph{memory networks} to attend to the large body of facts in the combination of text and KB. Our models can be trained in an end-to-end fashion on question-answer pairs. Evaluation results on \spades fill-in-the-blank question answering dataset show that exploiting universal schema for question…

Figures2

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1: QA results on Spades .

Model	Dev. F₁	Test F₁
Bisk et al. (2016)	32.7	31.4
OnlyKB	39.1	38.5
OnlyText	25.3	26.6
Ensemble.	39.4	38.6
UniSchema	41.1	39.9

Table 2. Table 2: A few questions on which OnlyKB fails to answer but UniSchema succeeds.

Question	Answer
1. USA have elected _blank_, our first african-american president.	Obama
2. Angelina has reportedly been threatening to leave _blank_.	Brad_Pitt
3. Spanish is more often a second and weaker language among many _blank_.	Latinos
4. _blank_ is the third largest city in the United_States.	Chicago
5. _blank_ was Belshazzar ’s father.	Nabonidus

Table 3. Table 3: Detailed results on Spades .

Model	Dev. F₁
OnlyKB correct	39.1
OnlyText correct	25.3
UniSchema correct	41.1
OnlyKB or OnlyText got it correct	45.9
Both OnlyKB and OnlyText got it correct	18.5
OnlyKB got it correct and OnlyText did not	20.6
OnlyText got it correct and OnlyKB did not	6.80
Both UniSchema and OnlyKB got it correct	34.6
UniSchema got it correct and OnlyKB did not	6.42
OnlyKB got it correct and UniSchema did not	4.47
Both UniSchema and OnlyText got it correct	19.2
UniSchema got it correct and OnlyText did not	21.9
OnlyText got it correct and UniSchema did not	6.09

Equations2

\vspace - 5 pt c_{t} = W_{t} c_{t - 1} + W_{p} (k, v) \in M \sum (c_{t - 1} \cdot k) v

\vspace - 5 pt c_{t} = W_{t} c_{t - 1} + W_{p} (k, v) \in M \sum (c_{t - 1} \cdot k) v

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Question Answering on Knowledge Bases and Text

using Universal Schema and Memory Networks

Rajarshi Das*∗♠* Manzil Zaheer*∗♡* Siva Reddy♣

Andrew McCallum♠

♠College of Information and Computer Sciences, University of Massachusetts Amherst

♡School of Computer Science, Carnegie Mellon University

♣School of Informatics, University of Edinburgh

{rajarshi, mccallum}@cs.umass.edu, [email protected]

[email protected]

Abstract

Existing question answering methods infer answers either from a knowledge base or from raw text. While knowledge base (KB) methods are good at answering compositional questions, their performance is often affected by the incompleteness of the KB. Au contraire, web text contains millions of facts that are absent in the KB, however in an unstructured form. Universal schema can support reasoning on the union of both structured KBs and unstructured text by aligning them in a common embedded space. In this paper we extend universal schema to natural language question answering, employing memory networks to attend to the large body of facts in the combination of text and KB. Our models can be trained in an end-to-end fashion on question-answer pairs. Evaluation results on Spades fill-in-the-blank question answering dataset show that exploiting universal schema for question answering is better than using either a KB or text alone. This model also outperforms the current state-of-the-art by 8.5 $F_{1}$ points.111Code and data available in https://rajarshd.github.io/TextKBQA

1 Introduction

Question Answering (QA) has been a long-standing goal of natural language processing. Two main paradigms evolved in solving this problem: 1) answering questions on a knowledge base; and 2) answering questions using text.

Knowledge bases (KB) contains facts expressed in a fixed schema, facilitating compositional reasoning. These attracted research ever since the early days of computer science, e.g., BASEBALL Green Jr et al. (1961). This problem has matured into learning semantic parsers from parallel question and logical form pairs Zelle and Mooney (1996); Zettlemoyer and Collins (2005), to recent scaling of methods to work on very large KBs like Freebase using question and answer pairs Berant et al. (2013). However, a major drawback of this paradigm is that KBs are highly incomplete Dong et al. (2014). It is also an open question whether KB relational structure is expressive enough to represent world knowledge Stanovsky et al. (2014); Gardner and Krishnamurthy (2017)

The paradigm of exploiting text for questions started in the early 1990s Kupiec (1993). With the advent of web, access to text resources became abundant and cheap. Initiatives like TREC QA competitions helped popularizing this paradigm Voorhees et al. (1999). With the recent advances in deep learning and availability of large public datasets, there has been an explosion of research in a very short time Rajpurkar et al. (2016); Trischler et al. (2016); Nguyen et al. (2016); Wang and Jiang (2016); Lee et al. (2016); Xiong et al. (2016); Seo et al. (2016); Choi et al. (2016). Still, text representation is unstructured and does not allow the compositional reasoning which structured KB supports.

An important but under-explored QA paradigm is where KB and text are exploited together Ferrucci et al. (2010). Such combination is attractive because text contains millions of facts not present in KB, and a KB’s generative capacity represents infinite number of facts that are never seen in text. However QA inference on this combination is challenging due to the structural non-uniformity of KB and text. Distant supervision methods Bunescu and Mooney (2007); Mintz et al. (2009); Riedel et al. (2010); Yao et al. (2010); Zeng et al. (2015) address this problem partially by means of aligning text patterns with KB. But the rich and ambiguous nature of language allows a fact to be expressed in many different forms which these models fail to capture. Universal schema Riedel et al. (2013) avoids the alignment problem by jointly embedding KB facts and text into a uniform structured representation, allowing interleaved propagation of information. Figure 1 shows a universal schema matrix which has pairs of entities as rows, and Freebase and textual relations in columns. Although universal schema has been extensively used for relation extraction, this paper shows its applicability to QA. Consider the question USA has elected blank, our first african-american president with its answer Barack Obama. While Freebase has a predicate for representing presidents of USA, it does not have one for ‘african-american’ presidents. Whereas in text, we find many sentences describing the presidency of Barack Obama and his ethnicity at the same time. Exploiting both KB and text makes it relatively easy to answer this question than relying on only one of these sources.

Memory networks (MemNN; Weston et al. 2015) are a class of neural models which have an external memory component for encoding short and long term context. In this work, we define the memory components as observed cells of the universal schema matrix, and train an end-to-end QA model on question-answer pairs.

The contributions of the paper are as follows (a) We show that universal schema representation is a better knowledge source for QA than either KB or text alone, (b) On the SPADES dataset Bisk et al. (2016), containing real world fill-in-the-blank questions, we outperform state-of-the-art semantic parsing baseline, with 8.5 $F_{1}$ points. (c) Our analysis shows how individual data sources help fill the weakness of the other, thereby improving overall performance.

2 Background

Problem Definition

Given a question $q$ with words $w_{1},w_{2},\ldots,w_{n}$ , where these words contain one $\_blank\_$ and at least one entity, our goal is to fill in this $\_blank\_$ with an answer entity $q_{a}$ using a knowledge base $\mathcal{K}$ and text $\mathcal{T}$ . Few example question answer pairs are shown in Table 2.

Universal Schema

Traditionally universal schema is used for relation extraction in the context of knowledge base population. Rows in the schema are formed by entity pairs (e.g. USA, NYC), and columns represent the relation between them. A relation can either be a KB relation, or it could be a pattern of text that exist between these two entities in a large corpus. The embeddings of entities and relation types are learned by low-rank matrix factorization techniques. Riedel et al. (2013) treat textual patterns as static symbols, whereas recent work by Verga et al. (2016) replaces them with distributed representation of sentences obtained by a RNN. Using distributed representation allows reasoning on sentences that are similar in meaning but different on the surface form. We too use this variant to encode our textual relations.

Memory Networks

MemNNs are neural attention models with external and differentiable memory. MemNNs decouple the memory component from the network thereby allowing it store external information. Previously, these have been successfully applied to question answering on KB where the memory is filled with distributed representation of KB triples Bordes et al. (2015), or for reading comprehension Sukhbaatar et al. (2015); Hill et al. (2016), where the memory consists of distributed representation of sentences in the comprehension. Recently, key-value MemNN are introduced Miller et al. (2016) where each memory slot consists of a key and value. The attention weight is computed only by comparing the question with the key memory, whereas the value is used to compute the contextual representation to predict the answer. We use this variant of MemNN for our model. Miller et al. (2016), in their experiments, store either KB triples or sentences as memories but they do not explicitly model multiple memories containing distinct data sources like we do.

3 Model

Our model is a MemNN with universal schema as its memory. Figure 1 shows the model architecture.

Memory:

Our memory $\mathcal{M}$ comprise of both KB and textual triples from universal schema. Each memory cell is in the form of key-value pair. Let $(\mathrm{s,r,o})\in\mathcal{K}$ represent a KB triple. We represent this fact with distributed key $\mathbf{k}\in\mathbb{R}^{2d}$ formed by concatenating the embeddings $\mathbf{s}\in\mathbb{R}^{d}$ and $\mathbf{r}\in\mathbb{R}^{d}$ of subject entity $s$ and relation $r$ respectively. The embedding $\mathbf{o}\in\mathbb{R}^{d}$ of object entity $o$ is treated as its value $\mathbf{v}$ .

Let $(\mathrm{s,~{}[w_{1},\ldots,arg_{1},\ldots,arg_{2},w_{n}],~{}o})\in\mathcal{T}$ represent a textual fact, where $\mathrm{arg_{1}}$ and $\mathrm{arg_{2}}$ correspond to the positions of the entities ‘ $\mathrm{s}$ ’ and ‘ $\mathrm{o}$ ’. We represent the key as the sequence formed by replacing $\mathrm{arg_{1}}$ with ‘ $s$ ’ and $arg_{2}$ with a special ‘ $\_\mathrm{blank}\_$ ’ token, i.e., $k$ = $[\mathrm{w_{1}},~{}\ldots,\mathrm{s},~{}\ldots,~{}\_\mathrm{blank}\_,~{}\mathrm{w_{n}}]$ and value as just the entity ‘ $\mathrm{o}$ ’. We convert $k$ to a distributed representation using a bidirectional LSTM Hochreiter and Schmidhuber (1997); Graves and Schmidhuber (2005), where $\mathbf{k}\in\mathbb{R}^{2d}$ is formed by concatenating the last states of forward and backward LSTM, i.e., $\mathbf{k}~{}=~{}\left[\overrightarrow{\mathrm{LSTM}}(\mathrm{k});\overleftarrow{\mathrm{LSTM}}(\mathrm{k})\right]$ . The value $\mathbf{v}$ is the embedding of the object entity $o$ . Projecting both KB and textual facts to $\mathbb{R}^{2d}$ offers a unified view of the knowledge to reason upon. In Figure 1, each cell in the matrix represents a memory containing the distributed representation of its key and value.

Question Encoder:

A bidirectional LSTM is also used to encode the input question $q$ to a distributed representation $\mathbf{q}\in\mathbb{R}^{2d}$ similar to the key encoding step above.

Attention over cells:

We compute attention weight of a memory cell by taking the dot product of its key $\mathbf{k}$ with a contextual vector $\mathbf{c}$ which encodes most important context in the current iteration. In the first iteration, the contextual vector is the question itself. We only consider the memory cells that contain at least one entity in the question. For example, for the input question in Figure 1, we only consider memory cells containing USA. Using the attention weights and values of memory cells, we compute the context vector $\mathbf{c_{t}}$ for the next iteration $t$ as follows:

[TABLE]

where $\mathbf{c_{0}}$ is initialized with question embedding $\mathbf{q}$ , $\mathbf{W_{p}}$ is a projection matrix, and $\mathbf{W_{t}}$ represents the weight matrix which considers the context in previous hop and the values in the current iteration based on their importance (attention weight). This multi-iterative context selection allows multi-hop reasoning without explicitly requiring a symbolic query representation.

Answer Entity Selection:

The final contextual vector $c_{t}$ is used to select the answer entity $q_{a}$ (among all 1.8M entities in the dataset) which has the highest inner product with it.

4 Experiments

4.1 Evaluation Dataset

We use Freebase Bollacker et al. (2008) as our KB, and ClueWeb Gabrilovich et al. (2013) as our text source to build universal schema. For evaluation, literature offers two options: 1) datasets for text-based question answering tasks such as answer sentence selection and reading comprehension; and 2) datasets for KB question answering.

Although the text-based question answering datasets are large in size, e.g., SQuAD Rajpurkar et al. (2016) has over 100k questions, answers to these are often not entities but rather sentences which are not the focus of our work. Moreover these texts may not contain Freebase entities at all, making these skewed heavily towards text. Coming to the alternative option, WebQuestions Berant et al. (2013) is widely used for QA on Freebase. This dataset is curated such that all questions can be answered on Freebase alone. But since our goal is to explore the impact of universal schema, testing on a dataset completely answerable on a KB is not ideal. WikiMovies dataset Miller et al. (2016) also has similar properties. Gardner and Krishnamurthy (2017) created a dataset with motivations similar to ours, however this is not publicly released during the submission time.

Instead, we use Spades Bisk et al. (2016) as our evaluation data which contains fill-in-the-blank cloze-styled questions created from ClueWeb. This dataset is ideal to test our hypothesis for following reasons: 1) it is large with 93K sentences and 1.8M entities; and 2) since these are collected from Web, most sentences are natural. A limitation of this dataset is that it contains only the sentences that have entities connected by at least one relation in Freebase, making it skewed towards Freebase as we will see (§ 4.4). We use the standard train, dev and test splits for our experiments. For text part of universal schema, we use the sentences present in the training set.

4.2 Models

We evaluate the following models to measure the impact of different knowledge sources for QA.

OnlyKB:

In this model, MemNN memory contains only the facts from KB. For each KB triple $(e_{1},r,e_{2})$ , we have two memory slots, one for $(e_{1},r,e_{2})$ and the other for its inverse $(e_{2},r^{i},e_{1})$ .

OnlyTEXT:

Spades contains sentences with blanks. We replace the blank tokens with the answer entities to create textual facts from the training set. Using every pair of entities, we create a memory cell similar to as in universal schema.

Ensemble

This is an ensemble of the above two models. We use a linear model that combines the scores from, and use an ensemble to combine the evidences from individual models.

UniSchema

This is our main model with universal schema as its memory, i.e., it contains memory slots corresponding to both KB and textual facts.

4.3 Implementation Details

The dimensions of word, entity and relation embeddings, and LSTM states were set to $d=$ 50. The word and entity embeddings were initialized with word2vec Mikolov et al. (2013) trained on 7.5 million ClueWeb sentences containing entities in Freebase subset of Spades. The network weights were initialized using Xavier initialization Glorot and Bengio (2010). We considered up to a maximum of 5k KB facts and 2.5k textual facts for a question. We used Adam Kingma and Ba (2015) with the default hyperparameters (learning rate=1e-3, $\beta_{1}$ =0.9, $\beta_{2}$ =0.999, $\epsilon$ =1e-8) for optimization. To overcome exploding gradients, we restricted the magnitude of the $\ell_{2}$ norm of the gradient to 5. The batch size during training was set to 32.

To train the UniSchema model, we initialized the parameters from a trained OnlyKB model. We found that this is crucial in making the UniSchema to work. Another caveat is the need to employ a trick similar to batch normalization Ioffe and Szegedy (2015). For each minibatch, we normalize the mean and variance of the textual facts and then scale and shift to match the mean and variance of the KB memory facts. Empirically, this stabilized the training and gave a boost in the final performance.

4.4 Results and Discussions

Table 1 shows the main results on Spades. UniSchema outperforms all our models validating our hypothesis that exploiting universal schema for QA is better than using either KB or text alone. Despite Spades creation process being friendly to Freebase, exploiting text still provides a significant improvement. Table 2 shows some of the questions which UniSchema answered but OnlyKB failed. These can be broadly classified into (a) relations that are not expressed in Freebase (e.g., african-american presidents in sentence 1); (b) intentional facts since curated databases only represent concrete facts rather than intentions (e.g., threating to leave in sentence 2); (c) comparative predicates like first, second, largest, smallest (e.g., sentences 3 and 4); and (d) providing additional type constraints (e.g., in sentence 5, Freebase does not have a special relation for father. It can be expressed using the relation parent along with the type constraint that the answer is of gender male).

We have also anlalyzed the nature of UniSchema attention. In 58.7% of the cases the attention tends to prefer KB facts over text. This is as expected since KBs facts are concrete and accurate than text. In 34.8% of cases, the memory prefers to attend text even if the fact is already present in the KB. For the rest (6.5%), the memory distributes attention weight evenly, indicating for some questions, part of the evidence comes from text and part of it from KB. Table 3 gives a more detailed quantitative analysis of the three models in comparison with each other.

To see how reliable is UniSchema, we gradually increased the coverage of KB by allowing only a fixed number of randomly chosen KB facts for each entity. As Figure 2 shows, when the KB coverage is less than 16 facts per entity, UniSchema outperforms OnlyKB by a wide-margin indicating UniSchema is robust even in resource-scarce scenario, whereas OnlyKB is very sensitive to the coverage. UniSchema also outperforms Ensemble showing joint modeling is superior to ensemble on the individual models. We also achieve the state-of-the-art with 8.5 $F_{1}$ points difference. Bisk et al. use graph matching techniques to convert natural language to Freebase queries whereas even without an explicit query representation, we outperform them.

5 Related Work

A majority of the QA literature that focused on exploiting KB and text either improves the inference on the KB using text based features Krishnamurthy and Mitchell (2012); Reddy et al. (2014); Joshi et al. (2014); Yao and Van Durme (2014); Yih et al. (2015); Neelakantan et al. (2015b); Guu et al. (2015); Xu et al. (2016b); Choi et al. (2015); Savenkov and Agichtein (2016) or improves the inference on text using KB Sun et al. (2015).

Limited work exists on exploiting text and KB jointly for question answering. Gardner and Krishnamurthy (2017) is the closest to ours who generate a open-vocabulary logical form and rank candidate answers by how likely they occur with this logical form both in Freebase and text. Our models are trained on a weaker supervision signal without requiring the annotation of the logical forms.

A few QA methods infer on curated databases combined with OpenIE triples Fader et al. (2014); Yahya et al. (2016); Xu et al. (2016a). Our work differs from them in two ways: 1) we do not need an explicit database query to retrieve the answers Neelakantan et al. (2015a); Andreas et al. (2016); and 2) our text-based facts retain complete sentential context unlike the OpenIE triples Banko et al. (2007); Carlson et al. (2010).

6 Conclusions

In this work, we showed universal schema is a promising knowledge source for QA than using KB or text alone. Our results conclude though KB is preferred over text when the KB contains the fact of interest, a large portion of queries still attend to text indicating the amalgam of both text and KB is superior than KB alone.

Acknowledgments

We sincerely thank Luke Vilnis for helpful insights. This work was supported in part by the Center for Intelligent Information Retrieval and in part by DARPA under agreement number FA8750-13-2-0020. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Andreas et al. (2016) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to Compose Neural Networks for Question Answering. In NAACL .
2Banko et al. (2007) Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open Information Extraction from the Web. In IJCAI .
3Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP .
4Bisk et al. (2016) Yonatan Bisk, Siva Reddy, John Blitzer, Julia Hockenmaier, and Mark Steedman. 2016. Evaluating Induced CCG Parsers on Grounded Semantic Parsing. In EMNLP .
5Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In ICDM .
6Bordes et al. (2015) Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. Co RR .
7Bunescu and Mooney (2007) Razvan C. Bunescu and Raymond J. Mooney. 2007. Learning to extract relations from the web using minimal supervision. In ACL .
8Carlson et al. (2010) Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Jr. Estevam R. Hruschka, and Tom M. Mitchell. 2010. Toward an Architecture for Never-ending Language Learning. In AAAI .