Neural Architecture for Question Answering Using a Knowledge Graph and Web Corpus
Uma Sawant, Saurabh Garg, Soumen Chakrabarti, Ganesh Ramakrishnan

TL;DR
AQQUCN is a flexible question answering system that effectively combines knowledge graph and web corpus evidence to improve entity retrieval across diverse query types, outperforming recent systems.
Contribution
It introduces a novel approach that integrates KG and corpus signals using convolutional networks, handling ambiguous and varied queries without relying on precise parsing.
Findings
5-16% improvement in mean average precision (MAP)
Almost doubled F1 scores for short queries
Effective handling of query ambiguity and syntax variation
Abstract
In Web search, entity-seeking queries often trigger a special Question Answering (QA) system. It may use a parser to interpret the question to a structured query, execute that on a knowledge graph (KG), and return direct entity responses. QA systems based on precise parsing tend to be brittle: minor syntax variations may dramatically change the response. Moreover, KG coverage is patchy. At the other extreme, a large corpus may provide broader coverage, but in an unstructured, unreliable form. We present AQQUCN, a QA system that gracefully combines KG and corpus evidence. AQQUCN accepts a broad spectrum of query syntax, between well-formed questions to short `telegraphic' keyword sequences. In the face of inherent query ambiguities, AQQUCN aggregates signals from KGs and large corpora to directly rank KG entities, rather than commit to one semantic interpretation of the query. AQQUCN…
| Query represented as a string | |
|---|---|
| Set of grounded entities in | |
| Specific grounded entities | |
| Candidate answer entity; set of candidates | |
| Set of gold (ground truth) answer entities | |
| Target type (given a query interpretation) in KG | |
| Relation types in KG |
| Query | Snippets from Entity-annotated Web corpus |
|---|---|
| spanish poet died civil war | [Positive] “Lorca was executed in 1936, during the spanish civil war.” |
| [Negative] “The murder of the spanish poet by nationalists in the civil war remains one of Spain’s open wounds.” | |
| Who was the first U.S. president ever to | [Positive] “Nixon become the first president in American history to resign.” |
| resign? | [Negative]“Gerald R. Ford took the oath of office after the first-ever resignation by a U.S. President.” |
| Type | Type patterns |
|---|---|
| /book/author | dramatist, author, journalist, poet, novelist, writer, editor |
| /people/deceased_person | dead, deceased, late, expired, deceased person, victim, person |
| /film/writer | screenwriter, writer |
| Relation | Relation patterns |
|---|---|
| /government/government_office_or_title/jurisdiction | of, president, president of, office |
| /film/writer/film | film, film by, by, of, written by, wrote, author of |
| /theater/play/composer | by, written by, music by, wrote, with music of, in |
| No. | Description |
|---|---|
| 1 | Sum of QCN match scores over all snippets for (, ) |
| 2–8 | Entity match features 1–7 from Bast and Haußmann (2015) |
| 9 | Sum of QRN match scores over all relations s.t. |
| 10–19 | Relation token match features 8–17 from Bast and Haußmann (2015) |
| 20 | Best QTN match score from all feasible types s.t. |
| 21–26 | General features 18–23 from Bast and Haußmann (2015) |
| 27 | AQQU-assigned rank of structured interpretation supporting |
| Source | Name | #train | #test | Query type |
|---|---|---|---|---|
| TREC and INEX | TREC-INEX-KW | 493 | 211 | Syntax-poor |
| query tracks | TREC-INEX | 493 | 211 | Syntax-rich |
| WebQuestions | WebQuestions-KW | 563 | 240 | Syntax-poor |
| WebQuestions | 3778 | 2032 | Syntax-rich |
| Data set | System | MAP | MRR | NDCG |
|---|---|---|---|---|
| TREC-INEX-KW | Joshi et al (2014) | 0.409 | 0.419 | 0.502 |
| AQQUCN-ALL | 0.536 | 0.561 | 0.587 | |
| TREC-INEX | Joshi et al (2014) | 0.358 | 0.362 | 0.426 |
| AQQUCN-ALL | 0.409 | 0.420 | 0.445 | |
| WebQuestions-KW | Joshi et al (2014) | 0.377 | 0.401 | 0.474 |
| AQQUCN-ALL | 0.525 | 0.543 | 0.575 | |
| WebQuestions | AQQUCN-ALL | 0.604 | 0.615 | 0.632 |
| Data set | AQQUCN-1 | AQQUCN-FEW | AQQUCN-ALL |
|---|---|---|---|
| TREC-INEX-KW | 0.264 | 0.388 | 0.417 |
| TREC-INEX | 0.269 | 0.285 | 0.323 |
| WebQuestions-KW | 0.492 | 0.437 | 0.392 |
| WebQuestions | 0.532 | 0.512 | 0.497 |
| Data | System | F1 |
| TREC-INEX-KW | Berant and Liang (2015) | 0.127 |
| AQQU (Bast and Haußmann, 2015) | 0.222 | |
| AQQUCN-Best (AQQUCN-ALL) | 0.417 | |
| AQQUCN-ALL (ideal threshold) | 0.578 | |
| KG+Corpus best single interpretation | 0.362 | |
| TREC-INEX | Berant and Liang (2015) | 0.107 |
| AQQU (Bast and Haußmann, 2015) | 0.258 | |
| AQQUCN-Best (AQQUCN-ALL) | 0.323 | |
| AQQUCN-ALL (ideal threshold) | 0.435 | |
| KG+Corpus best single interpretation | 0.395 | |
| WebQuestions-KW | Berant and Liang (2015) | 0.365 |
| AQQU (Bast and Haußmann, 2015) | 0.470 | |
| AQQUCN-Best (AQQUCN-1) | 0.492 | |
| AQQUCN-ALL (ideal threshold) | 0.570 | |
| KG+Corpus best single interpretation | 0.698 | |
| WebQuestions | Yao and Van Durme (2014) | 0.330 |
| Berant et al (2013) | 0.357 | |
| Yao (2015) | 0.443 | |
| Berant and Liang (2015) | 0.496 | |
| AQQU (Bast and Haußmann, 2015) | 0.521 | |
| STAGG (Yih et al, 2015) | 0.525 | |
| Text2KB (Savenkov and Agichtein, 2016) | 0.525 | |
| Text2KB+STAGG | 0.532 | |
| AQQUCN-Best (AQQUCN-1) | 0.532 | |
| Xu et al (2016) | 0.533 | |
| Text2KB+STAGG (ideal threshold) | 0.606 | |
| AQQUCN-ALL (ideal threshold) | 0.634 | |
| KG+Corpus best single interpretation | 0.737 |
| Data | System | F1 |
| AQQUCN-Best | 0.417 | |
| No QCN | 0.167 | |
| TREC-INEX-KW | No QTN | 0.410 |
| No QRN | 0.412 | |
| AQQUCN-Best | 0.323 | |
| No QCN | 0.192 | |
| TREC-INEX | No QTN | 0.317 |
| No QRN | 0.320 | |
| AQQUCN-Best | 0.492 | |
| No QCN | 0.480 | |
| WebQuestions-KW | No QTN | 0.475 |
| No QRN | 0.489 | |
| AQQUCN-Best | 0.532 | |
| No QCN | 0.526 | |
| WebQuestions | No QTN | 0.527 |
| No QRN | 0.529 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Natural Language Processing Techniques
11institutetext:
Neural Architecture for Question Answering
Using a Knowledge Graph and Web Corpus
Uma Sawant
Saurabh Garg
Soumen Chakrabarti
Ganesh Ramakrishnan
Abstract
In Web search, entity-seeking queries often trigger a special Question Answering (QA) system. It may use a parser to interpret the question to a structured query, execute that on a knowledge graph (KG), and return direct entity responses. QA systems based on precise parsing tend to be brittle: minor syntax variations may dramatically change the response. Moreover, KG coverage is patchy. At the other extreme, a large corpus may provide broader coverage, but in an unstructured, unreliable form. We present AQQUCN, a QA system that gracefully combines KG and corpus evidence. AQQUCN accepts a broad spectrum of query syntax, between well-formed questions to short “telegraphic” keyword sequences. In the face of inherent query ambiguities, AQQUCN aggregates signals from KGs and large corpora to directly rank KG entities, rather than commit to one semantic interpretation of the query. AQQUCN models the ideal interpretation as an unobservable or latent variable. Interpretations and candidate entity responses are scored as pairs, by combining signals from multiple convolutional networks that operate collectively on the query, KG and corpus. On four public query workloads, amounting to over 8,000 queries with diverse query syntax, we see 5–16% absolute improvement in mean average precision (MAP), compared to the entity ranking performance of recent systems. Our system is also competitive at entity set retrieval, almost doubling F1 scores for challenging short queries.
1 Introduction
A large fraction of Web queries involve and seek entities (Lin et al, 2012). Such queries may seek details of celebrities or movies (e.g., kingsman release date), historical events (e.g., Who killed Gandhi?), travel (e.g., nearest airport to baikal lake), and so on. Queries that match certain patterns are handed off to specialized QA systems that directly return entity responses from a KG. Sometimes, a semantic parse of the textual query is attempted (Berant et al, 2013; Yih et al, 2015) to translate it to a structured query over the KG, which is then executed to fetch a set of response entities111These are known as KBQA or “Knowledge Base Question Answering” systems.. While providing precise answers if everything goes well, this approach to KG-driven QA is fraught with several difficulties.
- •
The input textual query may range from grammatically well-formed questions (e.g., In which band did Jimmy Page perform before Led Zeppelin?) to free-form “telegraphic” keyword queries (e.g., band jimmy page was in before led zeppelin). QA systems are often brittle with regard to input syntax, backing off if the input does not match specific syntactic patterns.
- •
A curated, structured collection of facts in a KG reduces the QA task to “compiling” the textual query into a structured form which is directly executed on the KG. But KG coverage is always patchy — with nodes and/or edges missing — particularly when less popular entities are concerned. For example, over 70% of people in Freebase do not have a place of birth in Freebase (West et al, 2014). If the types and relations expressed textually in the query cannot be mapped confidently to the KG, most QA systems back off.
- •
Alternatively, one can extend IR-style text search by using an entity-annotated corpus of Web pages. Any text snippet in the corpus which mentions entity and also matches the question well, can be considered supporting evidence for that entity to be the answer for . However, such evidence from the Web corpus can be noisy due to incorrect entity linking of or , and imperfect text matching between and .
Example query and response:
Figure 1 demonstrates the advantages and complexities of effective entity-level QA involving both KG and corpus. Tokens in the query have diverse, possibly overlapping roles. Specifically, a query span may hint at an entity, type or relation, or it can be used to match passages in the corpus. Understanding the roles and disambiguating the hint to respective semantic nodes in the KG (wherever applicable) helps interpret that query. For example, the query band jimmy page was in before led zeppelin has a reference to entities e_{1}=\text{\path{Jimmy_Page}} and Led_Zeppelin. The set of such entities, grounded in the query, will be called with members , etc. Mentions in both the query and corpus documents are linked to entity nodes in the KG (e.g. jimmy page). The target type of the query will be denoted and a candidate answer entity will be denoted . The set of all answer candidates will be called , and the set of gold (ground truth) entities will be called . Band hints at t_{2}=\text{\path{musical_group}} of the expected answer entity e_{2}=\text{\path{The_Yardbirds}}. The (rather weak) hint was in hints at the relation r=\text{\path{/music/musical_group/member}} connecting . Thus, identifying , and can lead us to many candidate s, The_Yardbirds being one of them. Yet, the query interpretation is not complete because an important token ‘before’ is not considered. If the KG does not have timestamps on membership, or the QA engine cannot do arithmetic with timestamps, passages in the corpus can still offer supplementary evidence by matching before with prior to, along with mentions of and . Thus, using both KG and corpus allows combining structured and unstructured evidence to answer a query.
This example also serves to highlight our challenges. The various curved, colored lines in Figure 1 map the query hints to the KG or the corpus evidence, either created during pre-processing stage (e.g. entity linking in the corpus) or at run-time (e.g. matching the target type t_{2}=\text{\path{musical_group}} to the query text Band through a type model). The machine-learnt models which map the hints to KG entities, types or relations need to handle a great deal of ambiguity, as a hint may match many correct and incorrect KG artifacts. Thus, there may be multiple KG subgraphs and corpus text snippets, each appearing to support different correct or incorrect entity candidates. There is thus a clear need for robust and seamless aggregation of supporting evidence across corpus and KG.
Our contributions:
We present a new QA system, AQQUCN222Our system is named AQQUCN because it augments the AQQU system of Bast and Haußmann (2015) with convolutional networks., with these salient features:
- •
AQQUCN is resilient to a spectrum of query styles, between syntactically well-formed questions to short ‘telegraphic’ Web queries. It does not attempt a grammar-based parse of the query.
- •
AQQUCN uses KG and corpus signals in conjunction to score responses. Rather than a single comparison network between query and corpus (Severyn and Moschitti, 2015; Bahdanau et al, 2014), AQQUCN uses a heterogeneous network architecture tailored to structural properties of queries.
- •
Instead of choosing one structured interpretation and executing it on the KG to get a response set, AQQUCN is capable of directly ranking entities based on evidence pooled over multiple structured interpretations.
We review related work in Section 2. In Section 3 we give an overview of AQQUCN, and in Section 4 we describe all the modules in detail. In Section 5 we evaluate AQQUCN against recent competitive baseline systems. Our code with relevant data will be made available (CSAW, 2018).
2 Related work
Recent QA systems are the result of convergence between several communities: Information Retrieval (IR), NLP, machine learning, and neural networks.
2.1 Corpus-oriented entity search
Early work in the IR community focused on corpus-driven QA in the TREC-QA track (Wang, 2006; Cardie, 2012). The Web and IR community has traditionally assumed a free-form query that is often ‘telegraphic’. Web search queries being far more noisy, the goal of structure discovery is more modest. Indeed, in expert search, one of the earliest forms of corpus-based entity search focused on finding experts (people) in a given field, query structure discovery was given no importance. State-of-the-art expert search systems (Balog et al, 2009; Macdonald and Ounis, 2011; Petkova and Croft, 2007) collect text snippets (or documents) containing query words, match each snippet evidence to an expert (e.g. using the signal that the expert’s name is mentioned in the snippet); and aggregate such evidence snippets to rank the experts. Generative language models (Balog et al, 2006), proximity based kernels (Petkova and Croft, 2007) and feature-based supervised discriminative learning (Fang et al, 2010) were evaluated to score the evidence match. Conversely, document retrieval can be improved by expanding the query with entity features (Dalton et al, 2014).
2.2 Entity search from knowledge graphs (KGQA/KBQA)
As the information extraction and NLP communities developed more tools for annotating corpus spans with named entity (NE) types (Ling and Weld, 2012) and canonical entity IDs from KGs (Ganea and Hofmann, 2017), corpus-based techniques were refined to match answer types (Murdock et al, 2012). With support from large KGs like Wikipedia and Freebase, the NLP community developed semantic parsers (Berant et al, 2013; Yao and Van Durme, 2014; Yih et al, 2015; Kasneci et al, 2008; Pound et al, 2012; Yahya et al, 2012; Kwiatkowski et al, 2013) that translated natural language queries to a target graph query language similar to SPARQL. These approaches typically assume that question utterances are grammatically well-formed, from which precise clause structure, ground constants, variables, and connective relations can be inferred via semantic parsing. A similar system called AQQU (Bast and Haußmann, 2015) emerged from the IR community. We base our system on it, so we will describe it separately in Section 3.1. Such approaches are often correlated with the assumption that all usable knowledge has been curated into the KG. The query is first translated to a structured form, which is then executed on the KG.
2.3 Combining corpus and KG
There is increasing interest in combining corpus and KG for QA. AQQUCN is related to the single-relation QA system described by Joshi et al (2014). Unlike AQQUCN, they attempted an explicit 4-way segmentation of the query to identify the mention spans that mark grounded entities , mention span333Hint of relation might be distributed among multiple disjoint spans, but this is not a serious problem for our proposed system because we allow spans with multiple roles. that marks a hint to relation , span that marks a hint to the target type , and the remaining tokens are designated as selectors that are meant to keyword-match corpus snippets. This overall query segmentation is denoted . The segmentation guides their system to propose structured KG artifacts corresponding to query spans . The candidate response entity is then scored over admissible values of all latent variables (see Figure 2). KG and corpus signals are unified as various factors in the model. We address two limitations in this system. First, we do not attempt a hard query segmentation. Second, we replace the traditional discrete language models that inform the factor potentials with continuous neural counterparts.
More work in this vein followed rapidly. Xu et al (2016) presented a KGQA system with a corpus-based postprocessing pruning stage that removes candidates with weak corpus support. A more symmetric architecture called Text2KB was proposed by Savenkov and Agichtein (2016). A KGQA system as in Section 2.2 collects candidate answers. A corpus search collects snippets from top-ranking documents and annotates (Globerson et al, 2016; Ganea and Hofmann, 2017) them with KG entities. For each candidate in the union, features are collected from both KG and corpus snippets to rank them. In a similar spirit, Xiong et al (2017) propose AttR-Duet. Both the query and corpus passage are represented as bags of words as well as entities. Definition and mention texts in the KG and corpus are used to bridge between the space of entities and words, and four families of similarities () are defined. These are then combined using a neural network. Reminiscent of how Joshi et al (2014) incorporated a corpus-based factor/potential into a graphical model (Figure 2), Bast and Buchhold (2017) proposed QLever, which extended (Chakrabarti, 2010) SPARQL with predicates over an entity-annotated corpus. The primary focus of QLever is on high performance in the face of query clauses spanning KG and corpus indices, not ranking accuracy per se.
2.4 Complex QA using neural techniques
Early improvements to QA systems resulted from replacing discrete word matching and scoring with word vector counterparts. In corpus-based QA, Bordes et al (2014) modeled queries and passages as bags of words and simply added up their word embeddings to represent and compare them. Yang et al (2014) used embeddings to translate questions into relational predicates. More refined sentence/query embeddings have been created (Severyn and Moschitti, 2015; Iyyer et al, 2014) via recurrent networks (RNNs) and convolutional networks (CNNs), but usually applied to syntax-rich, well-formed questions. Dong et al (2015) obtained better accuracy than Bordes et al (2014) by replacing the aggregated word vector query representation with multiple parallel CNNs for extracting deep representations for relation between query entity (i.e. the entity mentioned in the query) and answer entity, answer type and the KG neighborhood of the query entity. While the above works dealt with entity retrieval, CNNs have also been actively explored in query-document matching for complex answer retrieval (Hui et al, 2017, 2018; MacAvaney et al, 2018). EviNets (Savenkov and Agichtein, 2017) embeds the query and evidence passage as average word vectors. Then it collects various aggregates (Macdonald and Ounis, 2011) of vector match scores, which are combined using a trained pooling network. None of these neural systems seek a structural understanding of the query and how its parts relate differentially to the KG and corpus. Recently, neural learning techniques are being used to translate very complex queries (Saha et al, 2018) like how many countries have more rivers than Brazil into multi-layer expression graphs or multi-step imperative programs (Andreas et al, 2016; Dong and Lapata, 2016; Miller et al, 2016; Zhong et al, 2017; Reed and De Freitas, 2015; Liang et al, 2016). The extreme complexity of the reinforcement learning formulations needed have, thus far, precluded scaling them to Web-scale corpora and noisy query and corpus annotation tools.
3 AQQUCN overview
AQQUCN, a system we have built by extending AQQU (Bast and Haußmann, 2015), implements all the inference pathways shown in Figure 1. Unlike AQQU and some other systems, the end goal of AQQUCN is not to “compile” the input into a structured query to execute on the KG, because broken/missing input syntax can make this attempt fail in brittle ways. Instead, the end goal of AQQUCN is to directly create a ranking over entities using KG and corpus, using interpretations as latent variables. We first review AQQU briefly, and then describe the three stages of AQQUCN in the rest of this section.
3.1 AQQU review
AQQU interprets the input question with reference to three possible “query templates”, each having a direct translation to a SPARQL query. First, grounded entities in the query are identified. Then, a guided expansion in the KG locates candidates , resulting in structured interpretations that may take forms such as these one- and two-hop queries:
- •
.
- •
, where is a mediator ‘entity’ often representing a ternary relation.
AQQU then extracts features from question , interpretation (which could have a few distinct structures as above), and all the candidate s together. These features are used in logistic regression or random forests to score each interpretation. The best interpretation (with unbounded placeholders for and , if applicable) is then executed on the KG. During training, the gold interpretation is not known, but executing each interpretation gives a system response set that can be compared against the gold to get an F1 score. This helps AQQU train logistic regression or random forests to score better interpretations higher than worse ones. The unit of scoring and ranking is a single interpretation, not an entity, nor a joint space of interpretation and entity, as is the case in AQQUCN. Moreover, AQQU does not use corpus signals.
3.2 Modifications and new modules
Our implementation of AQQUCN is based on AQQU because it provides a well-written, reusable implementation of query entity linking and interpretation generation. We modify and enhance AQQU in the following ways, as also illustrated in Figure 4 (for the query template). The important steps are listed below. (Item numbers correspond to modules in Figure 4.)
Given a query , we first identify the set of in-query entities using the widely used entity tagger TagMe444As in all QA systems, -linking accuracy does affect QA accuracy, but the variation is hard to characterize without a battery of entity linking methods with carefully controlled recall/precision profiles. AQQU gave slightly better accuracy with TagMe than with its own linker, so we used TagMe for all experiments. SMAPH (Cornolti et al, 2014) would be a better choice, but it is provided only as a network service, and it needs Google search as yet another level of network service, which has severe usage volume restriction. (Ferragina and Scaiella, 2010). 2. 2.
The KG neighborhood of each is explored to collect candidate s, similar to prior KBQA (Knowledge Base driven QA) systems (Bast and Haußmann, 2015; Yih et al, 2015; Xu et al, 2016). Like AQQU, we limit to the set of entities occurring in the 2-hop KG neighborhood555Two hops are needed to traverse mediator nodes like . of any entity in . 3. 3.
As the example in Figure 1 illustrated, some evidence supporting the correct entity, such as before in the query and prior to in the snippet, may come from the corpus. As we shall see in Section 5, corpus evidence can greatly augment KG-based evidence. Therefore, we also gather text snippets (at roughly the granularity of sentences) that mention some or words from the query. 4. 4.
The next step is to identify the set of candidate answer entities, . Apart from KG neighborhoods of s, we also collect (non-) entities that occur in any snippet as a candidate . This allows the system to recover from early errors, such as when identified by the entity tagger is empty or wrong, or when the query and answer entities are more than two hops apart in the KG. The union of candidates from KG and corpus are called s. 5. 5.
We use three neural modules to inform the score combination network, which replaces AQQU’s interpretation ranking module. The query corpus convnet (QCN), described in Section 4.1, scores the evidence in the context snippets, given the query and the candidate . 6. 6.
Candidates collected from the KG may be accompanied with designated types . 7. 7.
The query type convnet (QTN), described in Section 4.2, scores the potential presence of a textual clue to somewhere in the query. 8. 8.
Candidate found in the KG is also connected to via a relation . 9. 9.
The query relation convnet (QRN), described in Section 4.3, scores the potential presence of a textual clue to somewhere in the query. 10. 10.
The score combination network works on the joint space of candidate interpretations and entities, ending with a ranking of candidate entities that may draw signals from multiple interpretations in general. AQQU scores are used as additional features (not shown). Outputs from the three convnets are wired together in a markedly non-uniform architecture, consistent with the inference pathways shown in Figure 1.
A summary of notation introduced thus far is given in Table 1.
Unlike Joshi et al (2014), AQQUCN does not attempt to segment the query into disjoint spans that describe , but lets multiple neural networks run over the query. This allows AQQUCN to process long queries that Joshi et al (2014) could not. Moreover, a query span can inform multiple networks; consider queries who discovered penicillin and who discovered antarctica, where ‘who’ carries a lot less information about than the mentions ‘penicillin’ and ‘antarctica’.
4 Detailed design of AQQUCN modules
In this section we present the details of the new modules we added to AQQU: the query-corpus, query-type and query-relation networks, as well as the score combination network.
4.1 Query-Corpus Network (QCN)
For each query , we get zero or more evidence snippets from the entity-annotated Web corpus. Each snippet contains a candidate answer entity . We use the query corpus network (QCN) for assigning a relevance score to each snippet. This relevance score, and also the confidence score for linking to a snippet, are features used in the final score combination network of AQQUCN (Section 4.4).
Given a query and a snippet from the Web corpus, QCN should assign high score to the pair if the snippet contains evidence to correctly answer . Training data for this network is in the form of positive and negative snippets for each training query. As manual generation of such labels involves considerable effort, we resort to (possibly noisy) indirect supervision instead. We treat all text snippets centered around a gold answer entity and containing at least one query word (non-stopword) as positively labeled snippets. Similarly, we treat all text snippets centered around any non-answer entity and containing some or all query words as negatively labeled snippets (examples in Table 2). To train this network, we use the state-of-the-art short text ranking system proposed by Severyn and Moschitti (2015). Once QCN is trained, we have a score for each snippet belonging to a candidate answer entity .
While Siamese convolutional networks served well in the QCN module, the broad architecture of AQQUCN can accommodate competitive alternatives (Lv and Zhai, 2009; Petkova and Croft, 2007; Zhiltsov et al, 2015; Hui et al, 2017). Measuring the effect of this choice on QA accuracy is left for future work.
Match signals from multiple snippets supporting an entity have to be aggregated before passing on to the combination network shown at the bottom of Figure 4. Each candidate may have diverse number of supporting snippets. Usually, a number of standard aggregates (sum, max, etc.) are computed and then a weighted combination learnt (Balog et al, 2009; Macdonald and Ounis, 2006; Sawant and Chakrabarti, 2013; Joshi et al, 2014). For our data sets, we found a simple sum of snippet scores to be adequate: we add up the snippet scores over all snippets belonging to an entity for a query , and use it as a feature for (, ) (feature 1 in Table 5). More complex pooled aggregators can be explored in future work.
4.2 Query-Type Network (QTN)
The query-type network (QTN) outputs a compatibility score between the query and a candidate type . This is a multi-class, multi-label classification problem, as a query may imply more than one correct answer type (e.g. /music/composer and /music/artist for the query saturday night fever music band). Good quality training data in the form of (query, type) pairs is essential to ensure that the network learns to handle different types of (or the lack of) query syntax and its correspondence with the answer type. In our first attempt, we included all pairs in the training data, where was any type connected to the gold answer entity for the query in the training set. However, this strategy resulted in many spurious types. For example, /broadcast/radio_station_owner is not the correct answer type for the query maya moore college, even if the answer entity University_of_Connecticut belongs to that type. Therefore, we used human supervision. Paid student volunteers were asked to label pairs as correct or incorrect, which helped remove approximately 30% pairs as irrelevant and improve training data quality.
We found that, given the small number of training queries, the data obtained through above process may not be enough to understand the variety of syntax used to imply a type. For robust training, we resorted to representing the type through additional patterns (Table 3) obtained as follows.
Freebase relation names:
Consider a (subject, relation, object) fact triple e.g. (Captain_America:_The_First_Avenger, /film/film/prequel, Thor). Relations in freebase have composite names in the form “/x/y/z” where x is a topical domain and y and z indicate the types for subject and object. E.g. prequel is a type indicator word for Thor. Meanwhile, Freebase declares expected type for endpoint entities of each . For /film/film/prequel, expected end type is /film/film. We combine these two information nuggets and add prequel as a pattern expressing the type /film/film.
Freebase type names:
Ending substrings of the type name are also considered as patterns (e.g. ‘treatment’ for the type medical_treatment).
At the end of this exercise, we have zero or more patterns for each type (Table 3).
Figure 5 illustrates the design of QTN. This multi-class multi-label architecture is partly inspired by Kim (2014) and Severyn and Moschitti (2015). For each query input to the network, the network provides an output score vector of size equal to the number of types, as follows:
In the initial layer, each query word is represented as a vector embedding learnt during training. Then convolution and pooling layers are used to extract a fixed-length feature vector from the variable length input. 2. 2.
In a separate layer, we compute word overlap features inspired by Severyn and Moschitti (2015). Specifically, we compute Jaccard similarity between the query and each type name, by representing each as a bag of words. We also compute Jaccard similarity between the query and each type pattern, and then take max over all patterns of a given type. This process results in two features for each (query, type) pair. 3. 3.
Similar to the bag of words based overlap, we compute Jaccard similarity between the word stem (lemmatized) form for each query, type name and type pattern; resulting in two more features for each (query, type) pair. 4. 4.
Once the overlap features as well as convolution-pooling features are computed, a fully connected hidden layer with sigmoid activation function is used at the last stage, to score all types.
4.3 Query-Relation Network (QRN)
The Query-Relation Network (QRN) outputs a compatibility score between a candidate relation and the query . As with QTN, we generate multi-class, multi-label training data in the form of pairs, where is a relation connecting to the gold answer entity to the entity mentioned in the query . There could be multiple , for the same () tuple. Such training data generation process is common in previous work (Dong et al, 2015; Yih et al, 2015; Bordes et al, 2014) and human curation is not used to remove noise.
Similar to QTN, we enrich our training data using relation description patterns. To generate these patterns, we start with (subject, relation, object) facts in the KG and locate sentences in the annotated corpus where both subject and object are mentioned. We identify the path connecting the two in the dependency parse of the sentence, expressed as a sequence of lemmatized words. We count the number of times each path was found, and retain only the most frequent paths. This gives a bag of path patterns that describe relation (examples in Table 4). The network architecture for QRN is same as in Figure 5. The only difference is that for QRN we have (query, relation) and (corpus pattern, relation) tuples as training instances.
4.4 Score combination network
Referring back to Figure 4, QRN, QTN, and multiple QCNs send their scores as features to a final combination network that represents each candidate as a feature vector and scores it in conjunction with666For simplicity, we describe the single-relation case; multi-hop cases with mediator nodes are handled analogously. . In abstract terms, if denotes an interpretation and a candidate entity, at this point we have a score matrix indexed by interpretations as rows and candidate response entities as columns.
During both training and inference, is latent. Standard learning to rank methods (Liu, 2009) are not directly applicable to our score combination network because of the latent variables implicit in interpretation . In fact, the additional complications posed by these latent variables currently limit us to the relatively simple pairwise ranking paradigm with a linear scoring function (Joachims, 2002). Direct optimization of listwise and setwise metrics in presence of latent variables is left as future work. In the rest of this section, we will describe three approaches to train and deploy the score combination network.
4.4.1 Features for score combination
The score combination module shown at the bottom of Figure 4 uses a feature vector to describe the match between query , each candidate query interpretation and each candidate answer entity . These features are informed by three role-differentiated convolutional networks (described in detail in the rest of this section). These are
- •
the query-corpus network (QCN) described in Section 4.1,
- •
the query-type network (QTN) described in Section 4.2, and
- •
the query-relation network (QRN) described in Section 4.3.
As in AQQU, we also include additional features such as entity tagger scores. Table 5 shows the complete list of features.
4.4.2 Single interpretation (AQQUCN-1)
In some data sets like SimpleQuestions (Bordes et al, 2015), each query, by construction, can be answered from the KG alone, using exactly one interpretation. This is largely true of WebQuestions (Berant et al, 2013) as well. As we report later, the post-facto single best (‘silver’, because the ‘gold’ interpretation is not provided) interpretation retrieves the gold entity set with accuracy much higher than any system. Therefore, all that remains is to try to infer the silver interpretation. This is exactly what AQQU attempts to do. AQQU first aggregates over all candidates to get a per-interpretation score, which is used to rank them and choose the top interpretation. Training is provided by comparing the observed F1 scores of competing candidate interpretations. Our resulting system, AQQUCN-1, is similar to AQQU, except that we use convnets and draw on corpus information. The entity set retrieved by the single interpretation can then be sorted by decreasing for ranking, if needed.
4.4.3 Allowing a limited number of interpretations (AQQUCN-FEW)
The assumption that a single interpretation can recall all relevant answer entities may not be valid in all situations. In particular, as we shall report later, interpretations derived from both KG and corpus can be necessary to cover the gold response entities.
In a set of candidate entities , if the score of each is determined by its best supporting interpretation, then the number of distinct interpretations supporting the candidate set may approach itself. In the next section, we will allow that to happen freely. In this section, we will take a small step to generalize one interpretation to a limited number of interpretations.
Suppose the universe of available interpretations is , from which we can admit777We use for the number of top entities in the response to the user, and for the number of interpretations to be used internally. , with , while scoring all the candidate entities. The score of an entity is thereby restricted from to . We are given the set of gold (relevant) entities . Let irrelevant candidates be called . Then we want for any pair , which is turned into a hinge loss , where is a margin hyperparameter and is the hinge or ReLU operator. Summarizing, the loss we seek to minimize during training is
[TABLE]
During inference, we do not know . Therefore we find
[TABLE]
and then sort candidate entities by decreasing . Both expressions take time to evaluate that are exponential in , but we expect to be very small, usually under 3 (set heuristically). While we can try to optimize expression (1) directly to learn model parameters inside , the objective is highly nonconvex. We found it better to use the technique in the next section for training and use expression (2) for inference.
4.4.4 Allowing unlimited supporting interpretations (AQQUCN-ALL)
If a candidate entity is supported by multiple interpretations , a reasonable view is that the overall score of is , from the best supporting interpretation, which induces a ranking among candidate s. The set of gold s is then used to define a loss and train the combination network. We use a pairwise loss, comparing, for a fixed query , a relevant entity with an irrelevant entity :
[TABLE]
where is a margin hyperparameter. Note that the best supporting for may be different from the best supporting for .
- •
is the set of queries, and is one query.
- •
is a relevant entity, is an irrelevant entity, for query .
- •
is the feature vector (see Section 4.4.1) representing an interpretation, composed of (one or more entities mentioned in query ), (relation mentioned or hinted at in ), (type mentioned or hinted at in ) and (candidate answer entity). incorporates inputs from the three convnets.
- •
is a vector of non-negative slack variables.
- •
a balancing regularization parameter.
- •
is the weight vector to be learnt.
The max in the LHS of constraint (3) leads to nonconvexity, which we address by introducing auxiliary variables for each relevant candidate entity in the following optimization.
[TABLE]
For tractability, we relax the 0/1 constraint over variables to the continuous range :
[TABLE]
The relaxation does not correspond to any discrete interpretation, but is a device to make the optimization tractable. We obtain a local optimum for (4) by alternately updating and . Each of these is a convex optimization problem. Figure 6 shows the pseudocode for inference in our proposed system and can be directly and efficiently solved. Through optimization (4), AQQUCN integrates query interpretation and entity response ranking into a unified framework, rather than a two-stage compile-and-execute strategy common in other QA systems, which effectively gambles on one best structured interpretation.
4.5 End-to-end vs. modular training
In recent years, end-to-end training of complex neural architectures has lost some appeal. Shalev-Shwartz and Shashua (2016) showed that the sample complexity of end-to-end training can be exponentially larger than the modular training of individual stages. Roth (2017) made similar888Also see Chapter 11 (End-to-end Deep Learning) of http://www.mlyearning.org/ arguments about some NLP tasks. With complex notions of ‘match’ (Figure 1), a commensurately complex network (Figure 4) to train, and comparatively few training instances that do not come with gold structured interpretations, we, too, chose modular training of QRN, QTN and QCN, followed by training the score combination network. It might be argued that the additional labeled data used to train individual modules renders unfair the competition between various systems. While there is some validity to this protest, many open-domain QA systems already use externally trained word embeddings (Bast and Haußmann, 2015; Yih et al, 2015; Xu et al, 2016), externally-trained target type recognizers (Murdock et al, 2012; Yavuz et al, 2016), and augmented training from SimpleQuestions (Bordes et al, 2015).
4.6 Set retrieval vs. ranking and the threshold module
Choosing and then ranking candidate entities is the most natural method to drive AQQUCN. In this mode, AQQUCN can be directly compared against any other entity ranking system, such as the one by Joshi et al (2014). On the other hand, comparing AQQUCN-FEW and AQQUCN-ALL with other systems that retrieve entity sets is not directly possible, unless the desired size of the entity set were specified, or the ranked list is somehow truncated. One way to approach this is to threshold the ranked list based on some criterion. We explore two thresholding strategies. In the first strategy, we set the score threshold value to % of the top ranked entity’s score (“relative threshold”). Tuned on held-out data, turned out to be . In the second strategy (which we refer to as “ideal threshold”) we threshold at a position which results in the best value of F1 that can be extracted from the ranking output by our system. As is obvious, the first one provides unfair advantage to existing KBQA systems, whereas the second provides unfair advantage to our system and merely provides an idealized, non-constructive upper bound on F1.
5 Experiments and results
5.1 Testbed
KG:
We used Freebase (Bollacker et al, 2008) as the KG, specifically, the OpenLink Virtuoso snapshot provided with AQQU. It provides 2.9 billion relation facts on 44 million entities. The set of answer types, with about 4976 member types, is also curated and provided with AQQU. It should be possible to adapt other KGs such as WikiData for use with AQQU and AQQUCN.
Annotated corpus:
We used ClueWeb09B (ClueWeb09, 2009) as the corpus. It has about 50 million Web documents in WARC format, same as the rest of ClueWeb09 and ClueWeb12 (over 500 million documents each). Any corpus in WARC format can be used with AQQUCN, assuming they have useful entity annotations. For reproducibility, we used the public FACC1 entity annotations released by Google (Gabrilovich et al, 2013). The typical document has 13–15 entities annotated. Given enough computational capacity, one can run an entity tagger like TagMe (Ferragina and Scaiella, 2010) over the corpus. The number of tokens in each corpus snippet was limited to 20, based on hyperparamemer tuning in early versions of the system. We also verified that minor changes to snippet length did not change the results noticeably.
Query sets:
Joshi et al (2014) provided syntax-poor translations (CSAW, 2018) of syntax-rich queries from TREC and INEX question answering competitions, as well as a fraction of syntax-rich queries from WebQuestions (Berant et al, 2013). This gave us four query sets summarized in Table 6 and called TREC-INEX-KW, TREC-INEX, WebQuestions-KW and WebQuestions.
By design, all WebQuestions queries can be answered using the Freebase KG. In contrast, only 57% of TREC-INEX queries can be answered from KG alone under the restriction that and lie within two hops. Thus corpus evidence is important for TREC-INEX.
Convnet training protocol:
Data used to train the convnets is available (CSAW, 2018). Some important design choices are described below.
QTN and QRN:
Initial word vectors are learnt using the CNN-non-static version of Kim (2014). Filter sizes are set to 3 and 4 with 150 feature maps each. Drop-out rate is 0.5, with 100 epochs and early stopping using a validation split of 10%. Training is done through stochastic gradient descent over shuffled mini-batches with the Adadelta update.
QCN:
The width of the convolution filters is set to 5, the number of convolutional feature maps 150, batch size 50. L2 regularization term is for the parameters of convolutional layers and for all the others. The dropout rate is set to 0.5. We initialize the word vectors using the word embeddings trained by Huang et al (2012).
Score combination network training protocol:
All KG paths emanating from a query entity can contribute to candidate answer set . We use the pruning process of (Bast and Haußmann, 2015) to restrict the set of KG queries (and hence ) to a practical size. We normalize all feature values in before sending them to the score combination stage.
Evaluation protocol:
We evaluate entity ranking using measures common in Information Retrieval, applied to the response list of entities: mean average precision (MAP), mean reciprocal rank (MRR), and normalized discounted cumulative gain at rank 10 (NDCG@10). For ranking evaluation, the threshold module of Section 4.6 is not applied. For set retrieval evaluation, the system output entity set is created by thresholding the ranked list. It is then compared against gold set to compute recall, precision and F1.
5.2 QRN and QTN heatmaps
To understand the workings of QTN and QRN, we used the public implementation of Local Interpretable Model-Agnostic Explanations (LIME)999https://github.com/marcotcr/lime (Ribeiro et al, 2016). LIME linearly approximates a neural model’s behavior around the vicinity of a particular instance to detect the sensitivity of a label decision to input features. Figure 7 illustrates various sentences, their top predicted classes, and the sensitivity to each query word — positive, neural, or negative — in predicting that class. The observed polarities and intensities were generally intuitive.
5.3 Entity ranking comparison
The vast majority of KBQA papers report set retrieval accuracy in terms of recall, precision and F1. To our knowledge, only Joshi et al (2014) report entity ranking accuracy. We compare AQQUCN against their system in Table 7, using standard ranking performance measures: Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). In Table 7, we see 5–16% absolute improvement in mean average precision (MAP) over various query sets. The improvement is statistically significant (at ). Ablation studies in Section 5.6 suggest some causes for the improvements.
5.4 Effect of number of interpretations allowed
Figure 8 shows interpolated precision against recall, without thresholding, for the variants AQQUCN-1, AQQUCN-FEW and AQQUCN-ALL. The trends on the two datasets are opposites: on TREC-INEX, performance increases with the number of interpretations, but on WebQuestions, allowing more interpretations reduces F1. This makes sense, because 85% of WebQuestions queries can be answered using a single relation (Yao, 2015) and without corpus support. AQQUCN-FEW restricts the number of interpretations and limits the damage.
For AQQUCN-1, note that features (2–26) in Table 5 are unique to an interpretation. Consequently, when an entity has no support from QCN, particularly when the entity is rare or absent in the corpus, the scores of entities retrieved by an interpretation are all the same because of identical feature vectors. This situation is not so rare that we can ignore its effects. In such cases, even if AQQUCN-1 retrieves a set of reasonable quality, its ranking results from arbitrary tie-breaking. AQQUCN-ALL and AQQUCN-FEW are largely immune to this problem, because ties are less likely among the entity scores .
Table 8 reports F1 scores for AQQUCN-1, AQQUCN-FEW and AQQUCN-ALL, after applying thresholding. The trends are similar to those shown in Figure 8. WebQuestions has somewhat larger on average (natural queries: 2.4; telegraphic: 2.1), but KG-based interpretations are frequently adequate. Each KG-based interpretation covers more entities, and therefore, relatively fewer interpretations are needed to cover . In contrast, TREC-INEX has smaller s on average (natural queries: 1.5; telegraphic: 1.4). But TREC-INEX depends on corpus-based interpretations, each of which typically covers only one entity. Therefore, TREC-INEX needs more interpretations to get good F1 scores.
Given the almost exclusive focus on WebQuestions and SimpleQuestions in prior work, these important considerations were discovered only after instrumenting AQQUCN. Hereafter, we will refer to the best performing variant among AQQUCN-1, AQQUCN-FEW and AQQUCN-ALL as AQQUCN-Best.
5.5 Entity set retrieval comparison
In Table 9, we compare F1 of entity set retrieval across several101010For three cases, only AQQU (Bast and Haußmann, 2015) and Sempre (Berant and Liang, 2015) code were available. Text2KB is available at https://github.com/DenXX/aqqu, but with missing corpus files and no format specification. KBQA systems (CodaLab, 2016). AQQUCN is presented with both relative threshold (AQQUCN-Best) and ideal threshold, to separate the quality of ranking vs. thresholding. Also quoted is the F1 score achievable in principle if the single best interpretation is used from KG and corpus.
The first striking observation on Table 9 is that, for TREC-INEX, using a single interpretation is a terrible plan; both AQQUCN-Best and AQQUCN-ALL with ideal threshold are far better. Predictably, without access to a corpus, Berant and Liang (2015) and Bast and Haußmann (2015) perform poorly. In sharp contrast, owing to the nature of WebQuestions, a clairvoyant choice of the single best interpretation beats everything else (including Savenkov and Agichtein (2016) with ideal threshold) by a very wide margin. While AQQUCN with ideal threshold far exceeds all other systems, it is not even close to the best single interpretation. In terms of non-clairvoyant achievable accuracies, AQQUCN-Best is visibly best for WebQuestions-KW, and ranked second for WebQuestions. Clearly a great deal of ground has been covered since 2013 for WebQuestions, yet there is plenty of room to improve. It is also clear that AQQUCN is much better at ranking than thresholding.
5.6 Ablation tests
To understand the contributions of the three convnets to the score combination network, we removed each network in turn, re-trained the best performing model, and tabulated the resulting F1 scores in Table 10. TREC-INEX (in both query forms) suffers a serious hit if QCN is removed. This makes sense because of the critical evidence brought in by corpus snippets in case of TREC-INEX. The effect of removing QCN is smaller for WebQuestions, but still visible, showing that, even if gold entities are in the KG, corpus evidence can help score them better.
The influence between QRN and QTN is more nuanced. The relation involved in a query may asserts strong selectional preferences on the types of participating entities. Therefore, an accurate predicted by the QRN can often make up for mistakes made in predicting by the QTN. Conversely, accurately predicting may mitigate a misleading choice of . Overall, though, Table 10 shows at least some performance reduction if either QRN or QTN is removed. For the terse queries in WebQuestions-KW, removing QTN hurts more than removing QCN.
5.7 Wins and losses
We performed a side-by-side analysis of a sample of queries for which we found our system to be doing better and worse than related work. Our system improved on some queries containing qualifiers such as ‘first’, ‘oldest’, since we harness signals from the text corpus. For example, Who was the first U.S. president ever to resign? can be translated to a complex graph query involving a max/sort over dates, making it difficult to interpret it using only the knowledge graph. Yih et al (2015) handled some of these queries using extensively hand-engineered features. However, such information was readily found in the Web corpus (examples in Figure 2). The corpus also helped when the KG was incomplete (e.g., president sworn on airplane) or for answering queries with no clear (e.g., which kennedy died first?).
We performed worse on some queries, especially when the corpus signal added more noise than information and our type or relation CNNs were not able to narrow down to the correct answer. Some of these queries had a non-trivial syntactic structure, possibly not captured by the corpus. For example, What nation is home to the Kaaba?. At times, high annotation density in the corpus promoted popular non-answer entities over not-so-popular answer entities. For the query creator of the daily show, Jon_Stewart ranked above Madeleine_Smithberg, purely based on corpus popularity. Such cases highlight an opportunity to improve our corpus, type and relation CNNs.
6 Conclusion
We presented AQQUCN, a system that unifies structured interpretation of queries with ranking of response entities. Apart from seamlessly integrating corpus and KG information, AQQUCN has two salient features: it can deal with the full spectrum of query styles between keyword queries and well-formed questions; and it directly ranks response entities, rather than ‘compile’ the input to a structured query and execute that on the KG alone.
Acknowledgment:
Thanks to the reviewers for their constructive suggestions. Thanks to Elmar Haußmann for generous help with AQQU. Thanks to Doug Oard for advice on set vs. ranked retrieval. Thanks to Saurabh Sarda for migrating the code of Joshi et al (2014) to use AQQU. Partly supported by grants from IBM and nVidia.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Andreas et al (2016) Andreas J, Rohrbach M, Darrell T, Klein D (2016) Learning to compose neural networks for question answering. ar Xiv preprint ar Xiv:160101705 URL https://arxiv.org/pdf/1601.01705.pdf
- 2Bahdanau et al (2014) Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. Co RR abs/1409.0473, URL http://arxiv.org/abs/1409.0473
- 3Balog et al (2006) Balog K, Azzopardi L, de Rijke M (2006) Formal models for expert finding in enterprise corpora. In: SIGIR Conference, pp 43–50, DOI http://doi.acm.org/10.1145/1148170.1148181 , URL http://staff.science.uva.nl/~kbalog/files/sigir 2006-expertsearch.pdf
- 4Balog et al (2009) Balog K, Azzopardi L, de Rijke M (2009) A language modeling framework for expert finding. Information Processing and Management 45(1):1–19, DOI http://dx.doi.org/10.1016/j.ipm.2008.06.003
- 5Bast and Buchhold (2017) Bast H, Buchhold B (2017) Q Lever: A query engine for efficient sparql+text search. In: CIKM, pp 647–656, URL https://github.com/ad-freiburg/Q Lever
- 6Bast and Haußmann (2015) Bast H, Haußmann E (2015) More accurate question answering on freebase. In: CIKM, pp 1431–1440, URL http://ad-publications.informatik.uni-freiburg.de/CIKM_freebase_qa_BH_2015.pdf
- 7Berant and Liang (2015) Berant J, Liang P (2015) Imitation learning of agenda-based semantic parsers. TACL 3:545–558, URL https://www.transacl.org/ojs/index.php/tacl/article/view File/646/160
- 8Berant et al (2013) Berant J, Chou A, Frostig R, Liang P (2013) Semantic parsing on Freebase from question-answer pairs. In: EMNLP Conference, pp 1533–1544, URL http://aclweb.org/anthology//D/D 13/D 13-1160.pdf
