Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks
Rajarshi Das, Manzil Zaheer, Siva Reddy, Andrew McCallum

TL;DR
This paper introduces a universal schema approach combined with memory networks to improve question answering by integrating knowledge bases and unstructured text, achieving state-of-the-art results.
Contribution
It extends universal schema to natural language QA and employs memory networks for end-to-end training on combined KB and text data.
Findings
Outperforms KB or text alone in QA tasks
Achieves 8.5 F1 points improvement over previous state-of-the-art
Effective end-to-end training on question-answer pairs
Abstract
Existing question answering methods infer answers either from a knowledge base or from raw text. While knowledge base (KB) methods are good at answering compositional questions, their performance is often affected by the incompleteness of the KB. Au contraire, web text contains millions of facts that are absent in the KB, however in an unstructured form. {\it Universal schema} can support reasoning on the union of both structured KBs and unstructured text by aligning them in a common embedded space. In this paper we extend universal schema to natural language question answering, employing \emph{memory networks} to attend to the large body of facts in the combination of text and KB. Our models can be trained in an end-to-end fashion on question-answer pairs. Evaluation results on \spades fill-in-the-blank question answering dataset show that exploiting universal schema for questionā¦
Click any figure to enlarge with its caption.
Figure 1
Figure 2| Model | Dev. F1 | Test F1 |
|---|---|---|
| Bisk etĀ al. (2016) | 32.7 | 31.4 |
| OnlyKB | 39.1 | 38.5 |
| OnlyText | 25.3 | 26.6 |
| Ensemble. | 39.4 | 38.6 |
| UniSchema | 41.1 | 39.9 |
| Question | Answer |
|---|---|
| 1. USA have elected _blank_, our first african-american president. | Obama |
| 2. Angelina has reportedly been threatening to leave _blank_. | Brad_Pitt |
| 3. Spanish is more often a second and weaker language among many _blank_. | Latinos |
| 4. _blank_ is the third largest city in the United_States. | Chicago |
| 5. _blank_ was Belshazzar ās father. | Nabonidus |
| Model | Dev. F1 |
|---|---|
| OnlyKB correct | 39.1 |
| OnlyText correct | 25.3 |
| UniSchema correct | 41.1 |
| OnlyKB or OnlyText got it correct | 45.9 |
| Both OnlyKB and OnlyText got it correct | 18.5 |
| OnlyKB got it correct and OnlyText did not | 20.6 |
| OnlyText got it correct and OnlyKB did not | 6.80 |
| Both UniSchema and OnlyKB got it correct | 34.6 |
| UniSchema got it correct and OnlyKB did not | 6.42 |
| OnlyKB got it correct and UniSchema did not | 4.47 |
| Both UniSchema and OnlyText got it correct | 19.2 |
| UniSchema got it correct and OnlyText did not | 21.9 |
| OnlyText got it correct and UniSchema did not | 6.09 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Question Answering on Knowledge Bases and Text
using Universal Schema and Memory Networks
Rajarshi Das*āā *āManzil Zaheer*āā”*āSiva Reddyā£
āā
Andrew McCallumā
ā College of Information and Computer Sciences, University of Massachusetts Amherst
ā”School of Computer Science, Carnegie Mellon University
ā£School of Informatics, University of Edinburgh
{rajarshi, mccallum}@cs.umass.edu, [email protected]
Abstract
Existing question answering methods infer answers either from a knowledge base or from raw text. While knowledge base (KB) methods are good at answering compositional questions, their performance is often affected by the incompleteness of the KB. Au contraire, web text contains millions of facts that are absent in the KB, however in an unstructured form. Universal schema can support reasoning on the union of both structured KBs and unstructured text by aligning them in a common embedded space. In this paper we extend universal schema to natural language question answering, employing memory networks to attend to the large body of facts in the combination of text and KB. Our models can be trained in an end-to-end fashion on question-answer pairs. Evaluation results on Spades fill-in-the-blank question answering dataset show that exploiting universal schema for question answering is better than using either a KB or text alone. This model also outperforms the current state-of-the-art by 8.5 points.111Code and data available in https://rajarshd.github.io/TextKBQA
1 Introduction
Question Answering (QA) has been a long-standing goal of natural language processing. Two main paradigms evolved in solving this problem: 1)Ā answering questions on a knowledge base; and 2)Ā answering questions using text.
Knowledge bases (KB) contains facts expressed in a fixed schema, facilitating compositional reasoning. These attracted research ever since the early days of computer science, e.g., BASEBALL GreenĀ Jr etĀ al. (1961). This problem has matured into learning semantic parsers from parallel question and logical form pairs Zelle and Mooney (1996); Zettlemoyer and Collins (2005), to recent scaling of methods to work on very large KBs like Freebase using question and answer pairs Berant etĀ al. (2013). However, a major drawback of this paradigm is that KBs are highly incomplete Dong etĀ al. (2014). It is also an open question whether KB relational structure is expressive enough to represent world knowledge Stanovsky etĀ al. (2014); Gardner and Krishnamurthy (2017)
The paradigm of exploiting text for questions started in the early 1990s Kupiec (1993). With the advent of web, access to text resources became abundant and cheap. Initiatives like TREC QA competitions helped popularizing this paradigm Voorhees etĀ al. (1999). With the recent advances in deep learning and availability of large public datasets, there has been an explosion of research in a very short time Rajpurkar etĀ al. (2016); Trischler etĀ al. (2016); Nguyen etĀ al. (2016); Wang and Jiang (2016); Lee etĀ al. (2016); Xiong etĀ al. (2016); Seo etĀ al. (2016); Choi etĀ al. (2016). Still, text representation is unstructured and does not allow the compositional reasoning which structured KB supports.
An important but under-explored QA paradigm is where KB and text are exploited together Ferrucci etĀ al. (2010). Such combination is attractive because text contains millions of facts not present in KB, and a KBās generative capacity represents infinite number of facts that are never seen in text. However QA inference on this combination is challenging due to the structural non-uniformity of KB and text. Distant supervision methods Bunescu and Mooney (2007); Mintz etĀ al. (2009); Riedel etĀ al. (2010); Yao etĀ al. (2010); Zeng etĀ al. (2015) address this problem partially by means of aligning text patterns with KB. But the rich and ambiguous nature of language allows a fact to be expressed in many different forms which these models fail to capture. Universal schema Riedel etĀ al. (2013) avoids the alignment problem by jointly embedding KB facts and text into a uniform structured representation, allowing interleaved propagation of information. FigureĀ 1 shows a universal schema matrix which has pairs of entities as rows, and Freebase and textual relations in columns. Although universal schema has been extensively used for relation extraction, this paper shows its applicability to QA. Consider the question USA has elected blank, our first african-american president with its answer Barack Obama. While Freebase has a predicate for representing presidents of USA, it does not have one for āafrican-americanā presidents. Whereas in text, we find many sentences describing the presidency of Barack Obama and his ethnicity at the same time. Exploiting both KB and text makes it relatively easy to answer this question than relying on only one of these sources.
Memory networks (MemNN; Weston etĀ al. 2015) are a class of neural models which have an external memory component for encoding short and long term context. In this work, we define the memory components as observed cells of the universal schema matrix, and train an end-to-end QA model on question-answer pairs.
The contributions of the paper are as follows (a)Ā We show that universal schema representation is a better knowledge source for QA than either KB or text alone, (b) On the SPADES dataset Bisk etĀ al. (2016), containing real world fill-in-the-blank questions, we outperform state-of-the-art semantic parsing baseline, with 8.5Ā points. (c) Our analysis shows how individual data sources help fill the weakness of the other, thereby improving overall performance.
2 Background
Problem Definition
Given a question with words , where these words contain one and at least one entity, our goal is to fill in this with an answer entity using a knowledge base and text . Few example question answer pairs are shown in TableĀ 2.
Universal Schema
Traditionally universal schema is used for relation extraction in the context of knowledge base population. Rows in the schema are formed by entity pairs (e.g. USA, NYC), and columns represent the relation between them. A relation can either be a KB relation, or it could be a pattern of text that exist between these two entities in a large corpus. The embeddings of entities and relation types are learned by low-rank matrix factorization techniques. Riedel etĀ al. (2013) treat textual patterns as static symbols, whereas recent work by Verga etĀ al. (2016) replaces them with distributed representation of sentences obtained by a RNN. Using distributed representation allows reasoning on sentences that are similar in meaning but different on the surface form. We too use this variant to encode our textual relations.
Memory Networks
MemNNs are neural attention models with external and differentiable memory. MemNNs decouple the memory component from the network thereby allowing it store external information. Previously, these have been successfully applied to question answering on KB where the memory is filled with distributed representation of KB triples Bordes etĀ al. (2015), or for reading comprehension Sukhbaatar etĀ al. (2015); Hill etĀ al. (2016), where the memory consists of distributed representation of sentences in the comprehension. Recently, key-value MemNN are introduced Miller etĀ al. (2016) where each memory slot consists of a key and value. The attention weight is computed only by comparing the question with the key memory, whereas the value is used to compute the contextual representation to predict the answer. We use this variant of MemNN for our model. Miller etĀ al. (2016), in their experiments, store either KB triples or sentences as memories but they do not explicitly model multiple memories containing distinct data sources like we do.
3 Model
Our model is a MemNN with universal schema as its memory. FigureĀ 1 shows the model architecture.
Memory:
Our memory comprise of both KB and textual triples from universal schema. Each memory cell is in the form of key-value pair. Let represent a KB triple. We represent this fact with distributed key formed by concatenating the embeddings and of subject entity and relation respectively. The embedding of object entity is treated as its value .
Let represent a textual fact, where and correspond to the positions of the entities āā and āā. We represent the key as the sequence formed by replacing with āā and with a special āā token, i.e., = and value as just the entity āā. We convert to a distributed representation using a bidirectional LSTM Hochreiter and Schmidhuber (1997); Graves and Schmidhuber (2005), where is formed by concatenating the last states of forward and backward LSTM, i.e., . The value is the embedding of the object entity . Projecting both KB and textual facts to offers a unified view of the knowledge to reason upon. In FigureĀ 1, each cell in the matrix represents a memory containing the distributed representation of its key and value.
Question Encoder:
A bidirectional LSTM is also used to encode the input question to a distributed representation similar to the key encoding step above.
Attention over cells:
We compute attention weight of a memory cell by taking the dot product of its key with a contextual vector which encodes most important context in the current iteration. In the first iteration, the contextual vector is the question itself. We only consider the memory cells that contain at least one entity in the question. For example, for the input question in FigureĀ 1, we only consider memory cells containingĀ USA. Using the attention weights and values of memory cells, we compute the context vector for the next iteration as follows:
[TABLE]
where is initialized with question embedding , is a projection matrix, and represents the weight matrix which considers the context in previous hop and the values in the current iteration based on their importance (attention weight). This multi-iterative context selection allows multi-hop reasoning without explicitly requiring a symbolic query representation.
Answer Entity Selection:
The final contextual vector is used to select the answer entity (among all 1.8M entities in the dataset) which has the highest inner product with it.
4 Experiments
4.1 Evaluation Dataset
We use Freebase Bollacker etĀ al. (2008) as our KB, and ClueWeb Gabrilovich etĀ al. (2013) as our text source to build universal schema. For evaluation, literature offers two options: 1) datasets for text-based question answering tasks such as answer sentence selection and reading comprehension; and 2) datasets for KB question answering.
Although the text-based question answering datasets are large in size, e.g., SQuAD Rajpurkar etĀ al. (2016) has over 100k questions, answers to these are often not entities but rather sentences which are not the focus of our work. Moreover these texts may not contain Freebase entities at all, making these skewed heavily towards text. Coming to the alternative option, WebQuestions Berant etĀ al. (2013) is widely used for QA on Freebase. This dataset is curated such that all questions can be answered on Freebase alone. But since our goal is to explore the impact of universal schema, testing on a dataset completely answerable on a KB is not ideal. WikiMovies dataset Miller etĀ al. (2016) also has similar properties. Gardner and Krishnamurthy (2017) created a dataset with motivations similar to ours, however this is not publicly released during the submission time.
Instead, we use Spades Bisk et al. (2016) as our evaluation data which contains fill-in-the-blank cloze-styled questions created from ClueWeb. This dataset is ideal to test our hypothesis for following reasons: 1) it is large with 93K sentences and 1.8M entities; and 2) since these are collected from Web, most sentences are natural. A limitation of this dataset is that it contains only the sentences that have entities connected by at least one relation in Freebase, making it skewed towards Freebase as we will see (§ 4.4). We use the standard train, dev and test splits for our experiments. For text part of universal schema, we use the sentences present in the training set.
4.2 Models
We evaluate the following models to measure the impact of different knowledge sources for QA.
OnlyKB:
In this model, MemNN memory contains only the facts from KB. For each KB triple , we have two memory slots, one for and the other for its inverse .
OnlyTEXT:
Spades contains sentences with blanks. We replace the blank tokens with the answer entities to create textual facts from the training set. Using every pair of entities, we create a memory cell similar to as in universal schema.
Ensemble
This is an ensemble of the above two models. We use a linear model that combines the scores from, and use an ensemble to combine the evidences from individual models.
UniSchema
This is our main model with universal schema as its memory, i.e., it contains memory slots corresponding to both KB and textual facts.
4.3 Implementation Details
The dimensions of word, entity and relation embeddings, and LSTM states were set to 50. The word and entity embeddings were initialized with word2vec Mikolov etĀ al. (2013) trained on 7.5 million ClueWeb sentences containing entities in Freebase subset of Spades. The network weights were initialized using Xavier initialization Glorot and Bengio (2010). We considered up to a maximum of 5k KB facts and 2.5k textual facts for a question. We used Adam Kingma and Ba (2015) with the default hyperparameters (learning rate=1e-3, =0.9, =0.999, =1e-8) for optimization. To overcome exploding gradients, we restricted the magnitude of the norm of the gradient to 5. The batch size during training was set to 32.
To train the UniSchema model, we initialized the parameters from a trained OnlyKB model. We found that this is crucial in making the UniSchema to work. Another caveat is the need to employ a trick similar to batch normalization Ioffe and Szegedy (2015). For each minibatch, we normalize the mean and variance of the textual facts and then scale and shift to match the mean and variance of the KB memory facts. Empirically, this stabilized the training and gave a boost in the final performance.
4.4 Results and Discussions
TableĀ 1 shows the main results on Spades. UniSchema outperforms all our models validating our hypothesis that exploiting universal schema for QA is better than using either KB or text alone. Despite Spades creation process being friendly to Freebase, exploiting text still provides a significant improvement. TableĀ 2 shows some of the questions which UniSchema answered but OnlyKB failed. These can be broadly classified into (a) relations that are not expressed in Freebase (e.g., african-american presidents in sentenceĀ 1); (b) intentional facts since curated databases only represent concrete facts rather than intentions (e.g., threating to leave in sentenceĀ 2); (c) comparative predicates like first, second, largest, smallest (e.g.,Ā sentencesĀ 3 andĀ 4); and (d) providing additional type constraints (e.g., in sentenceĀ 5, Freebase does not have a special relation for father. It can be expressed using the relation parent along with the type constraint that the answer is of gender male).
We have also anlalyzed the nature of UniSchema attention. In 58.7% of the cases the attention tends to prefer KB facts over text. This is as expected since KBs facts are concrete and accurate than text. In 34.8% of cases, the memory prefers to attend text even if the fact is already present in the KB. For the rest (6.5%), the memory distributes attention weight evenly, indicating for some questions, part of the evidence comes from text and part of it from KB. TableĀ 3 gives a more detailed quantitative analysis of the three models in comparison with each other.
To see how reliable is UniSchema, we gradually increased the coverage of KB by allowing only a fixed number of randomly chosen KB facts for each entity. As FigureĀ 2 shows, when the KB coverage is less thanĀ 16 facts per entity, UniSchema outperforms OnlyKB by a wide-margin indicating UniSchema is robust even in resource-scarce scenario, whereas OnlyKB is very sensitive to the coverage. UniSchema also outperforms Ensemble showing joint modeling is superior to ensemble on the individual models. We also achieve the state-of-the-art withĀ 8.5 points difference. Bisk etĀ al. use graph matching techniques to convert natural language to Freebase queries whereas even without an explicit query representation, we outperform them.
5 Related Work
A majority of the QA literature that focused on exploiting KB and text either improves the inference on the KB using text based features Krishnamurthy and Mitchell (2012); Reddy etĀ al. (2014); Joshi etĀ al. (2014); Yao and VanĀ Durme (2014); Yih etĀ al. (2015); Neelakantan etĀ al. (2015b); Guu etĀ al. (2015); Xu etĀ al. (2016b); Choi etĀ al. (2015); Savenkov and Agichtein (2016) or improves the inference on text using KB Sun etĀ al. (2015).
Limited work exists on exploiting text and KB jointly for question answering. Gardner and Krishnamurthy (2017) is the closest to ours who generate a open-vocabulary logical form and rank candidate answers by how likely they occur with this logical form both in Freebase and text. Our models are trained on a weaker supervision signal without requiring the annotation of the logical forms.
A few QA methods infer on curated databases combined with OpenIE triples Fader etĀ al. (2014); Yahya etĀ al. (2016); Xu etĀ al. (2016a). Our work differs from them in two ways: 1) we do not need an explicit database query to retrieve the answers Neelakantan etĀ al. (2015a); Andreas etĀ al. (2016); and 2) our text-based facts retain complete sentential context unlike the OpenIE triples Banko etĀ al. (2007); Carlson etĀ al. (2010).
6 Conclusions
In this work, we showed universal schema is a promising knowledge source for QA than using KB or text alone. Our results conclude though KB is preferred over text when the KB contains the fact of interest, a large portion of queries still attend to text indicating the amalgam of both text and KB is superior than KB alone.
Acknowledgments
We sincerely thank Luke Vilnis for helpful insights. This work was supported in part by the Center for Intelligent Information Retrieval and in part by DARPA under agreement number FA8750-13-2-0020. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Andreas et al. (2016) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to Compose Neural Networks for Question Answering. In NAACL .
- 2Banko et al. (2007) Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open Information Extraction from the Web. In IJCAI .
- 3Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP .
- 4Bisk et al. (2016) Yonatan Bisk, Siva Reddy, John Blitzer, Julia Hockenmaier, and Mark Steedman. 2016. Evaluating Induced CCG Parsers on Grounded Semantic Parsing. In EMNLP .
- 5Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In ICDM .
- 6Bordes et al. (2015) Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. Co RR .
- 7Bunescu and Mooney (2007) Razvan C. Bunescu and Raymond J. Mooney. 2007. Learning to extract relations from the web using minimal supervision. In ACL .
- 8Carlson et al. (2010) Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Jr. Estevam R. Hruschka, and Tom M. Mitchell. 2010. Toward an Architecture for Never-ending Language Learning. In AAAI .
