TiFi: Taxonomy Induction for Fictional Domains [Extended version]
Cuong Xuan Chu, Simon Razniewski, Gerhard Weikum

TL;DR
TiFi is a novel method for constructing accurate taxonomies for fictional domains from noisy sources, outperforming existing approaches and enabling structured knowledge bases in poorly covered areas.
Contribution
TiFi introduces a three-phase process for taxonomy induction tailored to fictional domains, including category cleaning, edge cleaning, and top-level mapping, with high precision results.
Findings
Successfully constructs taxonomies for diverse fictional domains.
Outperforms state-of-the-art taxonomy induction methods.
Achieves high precision in noisy, domain-specific data.
Abstract
Taxonomies are important building blocks of structured knowledge bases, and their construction from text sources and Wikipedia has received much attention. In this paper we focus on the construction of taxonomies for fictional domains, using noisy category systems from fan wikis or text extraction as input. Such fictional domains are archetypes of entity universes that are poorly covered by Wikipedia, such as also enterprise-specific knowledge bases or highly specialized verticals. Our fiction-targeted approach, called TiFi, consists of three phases: (i) category cleaning, by identifying candidate categories that truly represent classes in the domain of interest, (ii) edge cleaning, by selecting subcategory relationships that correspond to class subsumption, and (iii) top-level construction, by mapping classes onto a subset of high-level WordNet categories. A comprehensive evaluation…
| Universe | # Categories | # Edges |
|---|---|---|
| Lord of the Rings (LoTR) | 973 | 1118 |
| Game of Thrones (GoT) | 672 | 1027 |
| Star Wars | 11012 | 14092 |
| Simpsons | 2275 | 4027 |
| World of Warcraft | 8249 | 11403 |
| Greek Mythology | 601 | 411 |
| Method | Universe | Precision | Recall | F1-score |
|---|---|---|---|---|
| Pasca 2018 (Pasca, 2018) | LoTR | 0.33 | 0.75 | 0.46 |
| GoT | 0.57 | 0.85 | 0.68 | |
| Ponzetto & Strube 2011 (Ponzetto and Strube, 2011) | LoTR | 0.44 | 1.0 | 0.61 |
| GoT | 0.45 | 1.0 | 0.62 | |
| Pasca + Ponzetto & Strube | LoTR | 0.41 | 0.75 | 0.53 |
| GoT | 0.64 | 0.85 | 0.73 | |
| TiFi | LoTR | 0.84 | 0.82 | 0.83 |
| GoT | 0.85 | 0.85 | 0.85 |
| Train | Test | Precision | Recall | F1-score |
|---|---|---|---|---|
| LoTR | GoT | 0.81 | 0.85 | 0.83 |
| GoT | LoTR | 0.64 | 0.88 | 0.74 |
| LoTR | Star Wars | 0.63 | 0.94 | 0.75 |
| LoTR | Simpsons | 0.91 | 0.63 | 0.74 |
| LoTR | World of Warcraft | 0.95 | 0.63 | 0.75 |
| LoTR | Greek Mythology | 0.86 | 0.6 | 0.71 |
| Train | Test | Precision | Recall | F1-score | MAP |
|---|---|---|---|---|---|
| LoTR | GoT | 0.81 | 0.79 | 0.80 | 0.92 |
| GoT | LoTR | 0.89 | 0.87 | 0.88 | 0.89 |
| GoT | Star Wars | 0.92 | 0.92 | 0.92 | 0.91 |
| GoT | Simpsons | 0.86 | 0.86 | 0.86 | 0.92 |
| GoT | Word of Warcraft | 0.72 | 0.71 | 0.72 | 0.76 |
| GoT | Greek Mythology | 0.92 | 0.92 | 0.92 | 0.92 |
| Proper-name edges | Concept edges | ||||||
| Method | Universe | Precision | Recall | F1-score | Precision | Recall | F1-score |
| HyperVec (Nguyen et al., 2017) | LoTR | 0.88 | 0.59 | 0.71 | 0.80 | 0.88 | 0.84 |
| GoT | 1.0 | 0.16 | 0.27 | 0.83 | 0.9 | 0.87 | |
| HEAD (Gupta et al., 2016a) | LoTR | 0.91 | 0.74 | 0.81 | 0.83 | 0.87 | 0.85 |
| GoT | 0.72 | 0.68 | 0.70 | 0.82 | 0.8 | 0.81 | |
| TiFi | LoTR | 0.92 | 0.79 | 0.85 | 0.88 | 0.89 | 0.88 |
| GoT | 0.96 | 0.68 | 0.8 | 0.90 | 0.91 | 0.91 | |
| Universe | #New Types | #New Edges | Precision |
|---|---|---|---|
| LoTR | 43 | 171 | 0.84 |
| GoT | 39 | 179 | 0.84 |
| Starwars | 373 | 3387 | 0.84 |
| Simpsons | 115 | 439 | 0.92 |
| World of Warcraft | 257 | 2248 | 0.84 |
| Greek Mythology | 22 | 76 | 0.84 |
| Universe | # Types | # Edges | Precision |
|---|---|---|---|
| LoTR | 353 | 648 | 0.88 |
| Game of Thrones | 292 | 497 | 0.83 |
| Star Wars | 7352 | 12282 | 0.90 |
| Simpsons | 1029 | 2171 | 0.88 |
| World of Warcraft | 4063 | 7882 | 0.76 |
| Greek Mythology | 139 | 313 | 0.91 |
| Text | Structured Sources | |||||||||||||||||||||
| Query | Wikia | Wikia-categories | TiFi | |||||||||||||||||||
| Dragons in LOTR |
|
|
|
|
||||||||||||||||||
|
- | Black Númenórean |
|
|
||||||||||||||||||
|
- | - |
|
Shelob, Great Spiders | ||||||||||||||||||
| Method | Universe | Precision | Recall | F1-score |
|---|---|---|---|---|
| HEAD (Gupta et al., 2016a) | LoTR | 0.27 | 0.05 | 0.09 |
| Simpsons | 0.31 | 0.09 | 0.14 | |
| TiFi | LoTR | 0.79 | 0.55 | 0.62 |
| Simpsons | 0.61 | 0.32 | 0.42 |
| Text | Structured Sources | |||
| Query | Wikia | Wikia-categories | TiFi | |
| 2 (52%) | 7 (65%) | 10 (62%) | 8 (87%) | |
| 1 (23%) | 2 (11%) | 8 (40%) | 3 (70%) | |
| 1 (20%) | 4 (36%) | 8 (63%) | 6 (79%) | |
| Average | 1 (32%) | 4 (37%) | 9 (55%) | 6 (79%) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies
TiFi: Taxonomy Induction for Fictional Domains
Extended version
Cuong Xuan Chu
Max Planck Institute for InformaticsSaarbrückenGermany
,
Simon Razniewski
Max Planck Institute for InformaticsSaarbrückenGermany
and
Gerhard Weikum
Max Planck Institute for InformaticsSaarbrückenGermany
(2019)
Abstract.
Taxonomies are important building blocks of structured knowledge bases, and their construction from text sources and Wikipedia has received much attention. In this paper we focus on the construction of taxonomies for fictional domains, using noisy category systems from fan wikis or text extraction as input. Such fictional domains are archetypes of entity universes that are poorly covered by Wikipedia, such as also enterprise-specific knowledge bases or highly specialized verticals. Our fiction-targeted approach, called TiFi, consists of three phases: (i) category cleaning, by identifying candidate categories that truly represent classes in the domain of interest, (ii) edge cleaning, by selecting subcategory relationships that correspond to class subsumption, and (iii) top-level construction, by mapping classes onto a subset of high-level WordNet categories. A comprehensive evaluation shows that TiFi is able to construct taxonomies for a diverse range of fictional domains such as Lord of the Rings, The Simpsons or Greek Mythology with very high precision and that it outperforms state-of-the-art baselines for taxonomy induction by a substantial margin.
Taxonomy Induction, Fictional Domain
††copyright: rightsretained††journalyear: 2019††conference: The Web Conference 2019; 13–17 May, 2019; San Francisco, USA
1. Introduction
1.1. Motivation and Problem
Taxonomy Induction: Taxonomies, also known as type systems or class subsumption hierarchies, are an important resource for a variety of tasks related to text comprehension, such as information extraction, entity search or question answering. They represent structured knowledge about the subsumption of classes, for instance, that electric guitar players are rock musicians and that state governors are politicans. Taxonomies are a core piece of large knowledge graphs (KGs) such as DBpedia, Wikidata, Yago and industrial KGs at Google, Microsoft Bing, Amazon, etc. When search engines receive user queries about classes of entities, they can often find answers by combining instances of taxonomic classes. For example, a query about “left-handed electric guitar players” can be answered by intersecting the classes left-handed people, guitar players and rock musicians; a query about “actors who became politicans” can include instances from the intersection of state governors and movie stars such as Schwarzenegger. Also, taxonomic class systems are very useful for type-checking answer candidates for semantic search and question answering (Kalyanpur et al., 2011).
Taxonomies can be hand-crafted, examples being WordNet (Fellbaum and Miller, 1998), SUMO (Niles and Pease, 2001) or MeSH and UMLS (Bodenreider, 2004), or automatically constructed by taxonomy induction from textual or semi-structured cues about type instances and subtype relations. Methods for the latter include text mining using Hearst patterns (Hearst, 1992) or bootstrapped with Hearst patterns (e.g., (Wu et al., 2012)), harvesting and learning from Wikipedia categories as a noisy seed network (e.g., (Ponzetto and Strube, 2007; Ponzetto and Navigli, 2009; Ponzetto and Strube, 2011; Suchanek et al., 2007; de Melo and Weikum, 2010; Flati et al., 2014; Gupta et al., 2016a; Wu et al., 2008)), and inducing type hierarchies from query-and-click logs (e.g., (Pasca and Durme, 2007; Pasca, 2013; Gupta et al., 2014)).
The Case for Fictional Domains: Fiction and fantasy are a core part of human culture, spanning from traditional literature to movies, TV series and video games. Well known fictional domains are, for instance, the Greek mythology, the Mahabharata, Tolkien’s Middle-earth, the world of Harry Potter, or the Simpsons. These universes contain many hundreds or even thousands of entities and types, and are subject of search-engine queries – by fans as well as cultural analysts. For example, fans may query about Muggles who are students of the House of Gryffindor (within the Harry Potter universe). Analysts may be interested in understanding character relationships (Iyyer et al., 2016; Bamman et al., 2014; Srivastava et al., 2016), learning story patterns (Chambers and Jurafsky, 2009; Chaturvedi et al., 2017) or investigating gender bias in different cultures (Agarwal et al., 2015). Thus, organizing entities and classes from fictional domains into clean taxonomies (see example in Fig. 1) is of great value.
Challenges: While taxonomy construction for encyclopedic knowledge about the real world has received considerable attention already, taxonomy construction for fictional domains is a new problem that comes with specific challenges:
State-of-the-art methods for taxonomy induction make assumptions on entity-class and subclass relations that are often invalid for fictional domains. For example, they assume that certain classes are disjoint (e.g., living beings and abstract entities, the oracle of Delphi being a counterexample). Also, assumptions about the surface forms of entity names (e.g., on person names: with or without first name, starting with Mr., Mrs., Dr., etc.) and typical phrases for classes (e.g., noun phrases in plural form) do not apply to fictional domains.
- 2.
Prior methods for taxonomy induction intensively leveraged Wikipedia categories, either as a content source or for distant supervision. However, the coverage of fiction and fantasy in Wikipedia is very limited, and their categories are fairly ad-hoc. For example, Lord Voldemort is in categories like Fictional cult leaders (i.e., people), J.K. Rowling characters (i.e., a meta-category) and Narcissism in fiction (i.e., an abstraction). And whereas Harry Potter is reasonably covered in Wikipedia, fan websites feature many more characters and domains such as House of Cards (a TV series) or Hyperion Cantos (a 4-volume science fiction book) that are hardly captured in Wikipedia.
- 3.
Both Wikipedia and other content sources like fan-community forums cover an ad-hoc mixture of in-domain and out-of-domain entities and types. For example, they discuss both the fictional characters (e.g., Lord Voldemort) and the actors of movies (e.g., Ralph Fiennes) and other aspects of the film-making or book-writing.
The same difficulties arise also when constructing enterprise-specific taxonomies from highly heterogeneous and noisy contents, or when organizing types for highly specialized verticals such as medieval history, the Maya culture, neurodegenerative diseases, or nano-technology material science. Methodology for tackling such domains is badly missing. We believe that our approach to fictional domains has great potential for being carried over to such real-life settings. This paper focuses on fiction and fantasy, though, where raw content sources are publicly available.
1.2. Approach and Contribution
In this paper we develop the first taxonomy construction method specifically geared for fictional domains. We refer to our method as the TiFi system, for Taxonomy induction for Fiction. We address Challenge 1 by developing a classifier for categories and subcategory relationships that combines rule-based lexical and numerical contextual features. This technique is able to deal with difficult cases arising from non-standard entity names and class names. Challenge 2 is addressed by tapping into fan community Wikis (e.g., harrypotter.wikia.com). This allows us to overcome the limitations of Wikipedia. Finally, Challenge 3 is addressed by constructing a supervised classifier for distinguishing in-domain vs. out-of-domain types, using a feature model specifically designed for fictional domains.
Moreover, we integrate our taxonomies with an upper-level taxonomy provided by WordNet, for generalizations and abstract classes. This adds value for searching by entities and classes. Our method outperforms the state-of-the-art taxonomy induction system for the first two steps, HEAD (Gupta et al., 2016a), by 21-23% and 6-8% percentage points in F1-score, respectively. An extrinsic evaluation based on entity search shows the value that can be derived from our taxonomies, where, for different queries, our taxonomies return answers with 24% higher precision than the input category systems. Along with the code of the TiFi system, we will publish taxonomies for 6 fictional universes.
2. Related Work
Text Analysis and Fiction
Analysis and interpretation of fictional texts are an important part of cultural and language research, both for the intrinsic interest in understanding themes and creativity (Chambers and Jurafsky, 2009; Chaturvedi et al., 2017), and for extrinsic reasons such as predicting human behaviour (Fast et al., 2016) or measuring discrimination (Agarwal et al., 2015). Other recurrent topics are, for instance, to discover character relationships (Iyyer et al., 2016; Bamman et al., 2014; Srivastava et al., 2016), to model social networks (Elangovan and Eisenstein, 2015; Bamman et al., 2014), or to describe personalities and emotions (Elson et al., 2010; Jhavar and Mirza, 2018). Traditionally requiring extensive manual reading, automated NLP techniques have recently lead to the emergence of a new interdisciplinary subject called Digital Humanities, which combines methodologies and techniques from sociology, linguistics and computational sciences towards the large-scale analysis of digital artifacts and heritage.
Taxonomy Induction from Text
Taxonomies, that is, structured hierarchies of classes within a domain of interest, are a basic building block for knowledge organization and text processing, and crucially needed in tasks such as entity detection and linking, fact extraction, or question answering. A seminal contribution towards their automated construction was the discovery of Hearst patterns (Hearst, 1992), simple syntactic patterns like “X is a Y” that achieve remarkable precision, and are conceptually still part of many advanced approaches. Subsequent works aim to automate the process of discovering useful patterns (Snow et al., 2005; Roller and Erk, 2016). Recent work by Gupta et al. (Gupta et al., 2017) uses seed terms in combination with a probabilistic model to extract hypernym subsequences, which are then put into a directed graph from which the final taxonomy is induced by using a minimum cost flow algorithm. Other approaches utilize distributional representations of types (Roller et al., 2014; Yu et al., 2015; Nguyen et al., 2017; Vu and Shwartz, 2018), or aim to learn them pairwise (Yu et al., 2015) or hierarchically (Nguyen et al., 2017).
Taxonomy Construction using Wikipedia
A popular structured source for taxonomy construction is the Wikipedia category network (WCN) for taxonomy induction. The WCN is a collaboratively constructed network of categories with many similarities to taxonomies, expressing for instance that the category Italian 19th century composers is a subcategory of Italian Composers. One project, WikiTaxonomy (Ponzetto and Strube, 2007, 2011) aims to classify subcategory relations in the WCN as subclass and not-subclass relations. They investigate heuristics based on lexical matching between categories, lexico-syntactic patterns and the structure of the category network for that purpose. YAGO (Suchanek et al., 2007; Hoffart et al., 2013) uses a very simple criterion to decide whether a category represents a class, namely to check whether it is in plural form. It also provides linking to WordNet (Fellbaum and Miller, 1998) categories, choosing in case of ambiguity simply the meaning appearing topmost in WordNet. MENTA (de Melo and Weikum, 2010) learns a model to map Wikipedia categories to WordNet, with the goal of constructing a multilingual taxonomy over both. MENTA creates mean edges and subclass edges between categories and entities across languages, then uses Markov chains to rank edges and induce the final taxonomy. WiBi (Wikipedia Bitaxonomy) (Flati et al., 2014) proceeds in two steps: It first builds a taxonomy from Wikipedia pages by extracting lemmas from the first sentence of pages, and heuristically disambiguating them and linking them to others. In the second step, WiBi combines the page taxonomy and the original Wikipedia category network to induce the final taxonomy. The most recent effort working on taxonomy induction over Wikipedia is HEAD (Gupta et al., 2016a). HEAD exploits multiple lexical and structural rules towards classifying subcategory relations, and is judiciously tailored towards high-quality extraction from the WCN.
Domain-specific Taxonomies
TAXIFY is an unsupervised approach to domain-specific taxonomy construction from text (Alfarone and Davis, 2015). Relying on distributional semantics, TAXIFY creates subclass candidates, which in a second step are filtered based on a custom graph algorithm. Similarly, Liu et al. (Liu et al., 2012) construct domain-specific taxonomies from keyword phrases augmented with relative knowledge and contexts. Compared with taxonomy construction from structured resources, these text-based approaches usually deliver comparably flat taxonomies.
Fan Wikis
Fans are organizing content on fictional universes on a multitude of webspaces. Particularly relevant for our problem are fan Wikis, i.e., community-built web content constructed using generic Wiki frameworks. Some notable examples of such Wikis are tolkiengateway.net/wiki, with 12k articles, www.mariowiki.com with 21k articles, or en.brickimedia.org with 29k articles. Particularly relevant are also Wiki farms, like Wikia111www.wikia.com/fandom and Gamepedia222www.gamepedia.com, which host Wikis for 380k and 2k different fictional universes, and have Alexa rank 49 and 340, respectively.
In these Wikis, like on Wikipedia, editors collaboratively create and curate content. These Wikis come with support for categories, the The Lord of the Rings Wiki, for instance, having over 900 categories and over 1000 subcategory relationships, the Star Wars Wiki having 11k and 14k of each, respectively. Similarly as on Wikipedia, these category networks do not represent clean taxonomies in the ontological sense, containing for instance meta categories such as 1980 films, or relations such as Death in Battle being a subcategory of Character.
3. Design Rationale and Overview
3.1. Design Space and Choices
Input: The input to the taxonomy induction problem is a set of entities, such as locations, characters and events, each with a description in the form of associated text or tags and categories. Entities with textual descriptions are easily available in many forums incl. Wikipedia, wikis of fan communities or scholarly collaborations, and other online media. Tags and categories, including some form of category hierarchy, are available in various kinds of wikis – typically in very noisy form, though, with a fair amount of uninformative and misleading connections. When such sites merely provide tags for entities, we can harness subsumptions between tags (e.g., simple association rules) to derive a folksonomy (see, e.g., (Hotho et al., 2006; Jäschke et al., 2007; Fang et al., 2016)) and use this as an initial category system. When only text is available, we can use Hearst patterns and other text-based techniques (Hearst, 1992; Sanderson and Croft, 1999; Cimiano et al., 2005) to generate categories and construct a subsumption-based tree.
Output: Starting with a noisy category tree or graph for a given set of entities, from a domain of interest, the goal of TiFi is to construct a clean taxonomy that preserves the valid and appropriate classes and their instance-of and subclass-of relationships but removes all invalid or misleading categories and connections. Formally, the output of TiFi is a directed acyclic graph (DAG) with vertices and edges such that (i) non-leaf vertices are semantic classes relevant for the domain, (ii) leaf vertices are entities, (iii) edges between leaves and their parents denote which entities belong to which classes, (iv) edges among non-leaf vertices denote subclass-of relationships.
There is a wealth of prior literature on taxonomy induction methods, and the design space for going about fictitious and other non-standard domains has many options. Our design decisions are driven by three overarching considerations:
We leverage whatever input information is available, even if it comes with a high degree of noise. That is, when an online community provides categories, we use them. When there are only tags or merely textual descriptions, we first build an initial category system using folksonomy construction methods and/or Hearst patterns.
For the output taxonomy, we prioritize precision over recall. So our methods mostly focus on removing invalid vertices and edges. Moreover, to make classes for fictitious domains more interpretable and support cross-domain comparisons (e.g., for search), we aim to align the domain-specific classes with appropriate upper-level classes from a general-purpose ontology, using WordNet (Fellbaum and Miller, 1998). For example, dragons in Lord of the Rings should be linked to the proper WordNet sense of dragons, which then tells us that this is a subclass of mythical creatures.
It may seem tempting to cast the problem into an end-to-end machine-learning task. However, this would require sufficient training data in the form of pairs of input datasets and gold-standard output taxonomies. Such training data is not available, and would be hard and expensive to acquire. Instead, we break the overall task down into focused steps at the granularity of individual vertices and individual edges of category graphs. At this level, it is much easier to acquire labeled training data, by crowdsourcing (e.g., mturk). Moreover, we can more easily devise features that capture both local and global contexts, and we can harness external assets like dictionaries and embeddings.
3.2. TiFi Architecture
Based on the above considerations, we approach taxonomy induction in three steps, (1) category cleaning, (2) edge cleaning, (3) top-level construction. The architecture of TiFi is depicted in Fig. 2. Fig. 3 illustrates how TiFi constructs a taxonomy.
The first step, category cleaning (Section 4), aims to clean the original set of categories by identifying categories that truly represent classes within the domain of interest, and by removing categories that represent, for instance, meta-categories used for community or Wikia coordination, or concern topics outside of the fictional domain, like movie or video game adaptions, award wins, and similar. Previous work has tackled this step via syntactic and lexical rules (Suchanek et al., 2007; Pasca, 2018; Ponzetto and Strube, 2007). While such custom-tailored rules can achieve high accuracy, they have limitations w.r.t. applicability across domains. We thus opt for a supervised classification approach that combines rules from above with additional graph-based features. This way, taxonomy construction for a new domain only requires new training examples instead of new rules. Moreover, our experiments show that, to a reasonable extent, models can be reused across domains.
The second step, edge cleaning (Section 5), identifies the edges from the original category network that truly represent subcategory relationships. Here, both rule-based (Gupta et al., 2016b; Ponzetto and Strube, 2007) and embedding-based approaches (Nguyen et al., 2017) appear in the literature. Each approach has its strength, however, rules again have limitations wrt. applicability across domains, while embeddings may disregard useful syntactic features, and crucially rely on enough textual content for learning. We thus again opt for a supervised approach, allowing us to combine existing lexical and embedding-based approaches with various adapted semantic and novel graph-based features.
For the third step, top-level construction (Section 6), basic choices are to aim to construct the top levels of taxonomies from input category networks (Ponzetto and Strube, 2007; Gupta et al., 2016b), or to reuse existing abstract taxonomies such as WordNet (Suchanek et al., 2007). As fan Wikis (and even Wikipedia) generally have a comparably small coverage of abstract classes, we here opt for the reuse of the existing WordNet top-level classes. This also comes with the additional advantage of establishing a shared vocabulary across domains, allowing to query, for instance, for animal species appearing both in LoTR and GoT (with answers such as dragons).
4. Category Cleaning
In the first step, we aim to select the categories from the input that actually represent classes in the domain of interest. There are several reasons why a category would not satisfy this criterion, including the following:
Meta-categories: Wiki platforms typically introduce metacategories related to administration and technical setup, e.g., Meta or Administration.
Contextual categories: Community Wikis usually contain also information about the production of the universes (e.g., inspirations or actors), about the reception (e.g., awards), and about remakes and adaptions, which do not related to the real content of the universes.
Instances: Editors frequently create categories that are actually instances, e.g., Arda or Mordor in The Lord of The Rings).
Extensions: Wikis sometimes also contains fan-made extensions of universes that are not universally agreed upon.
Previous works on Wikipedia remove either only meta-categories or instances by using crafted lexical rules (Ponzetto and Strube, 2007, 2011; Pasca, 2018). As our setting has to deal with a wider range of noise, we instead choose the use of supervised classification. We use a logistic regression classifier with binary (0/1) lexical and integer graph-based features, as detailed next.
A. Lexical Features
Meta-categories: True if a categories’ name contains one of 22 manually selected strings, such as wiki, template, user, portal, disambiguation, articles, administration, file,pages, etc.
Plural categories: True if the headword of a category is in plural form. We use shallow parsing to extract headwords, for instance, identifying the plural term Servants in Servants of Morgoth, a strong indicator for a class.
Capitalization: True if a category starts with a capital letter. We introduced this feature as we observed that in fiction, lowercase categories frequently represent non-classes.
B. Graph-based Features
Instance count: The number of direct instances of a category.
Supercategory/subcategory count: The number of super/subcategories of a category, e.g., 0/2 for Characters in Fig. 3 (left). Categories with more instances, superclasses or subclasses have potentially more relevance.
Average depth: Average upward path length from a category. Categories with short paths above are potentially more likely not relevant.
Connected subgraph size: The maximal size of connected subgraphs which a given category belongs to. Each connected subgraph is extracted by using depth first search on each root of the input category network. Meta-categories are sometimes disconnected from the core classes of a universe.
While the first two are established features, all other features have been newly designed to especially meet the characteristics of fiction. As we show in Section 7, this varied feature set allows to identify in-domain classes with 83%-85% precision.
5. Edge Cleaning
Once the categories that represent classes in the domain of interest have been identified, the next task is to identify which subcategory relationships also represent subclass relationships. While most previous works rely on rules (Ponzetto and Strube, 2007; de Melo and Weikum, 2010; Flati et al., 2014; Gupta et al., 2016a), these are again too inflexible for the diversity of fictional universes. We thus tackle the task using supervised learning, relying on a combination of syntactic, semantic and graph-based features for a regression model.
A. Syntatic Features
Head Word Matching
Head word matching is arguably the most popular feature for taxonomy induction. Categories sharing the same headword, for instance Realms and Dwarven Realms are natural candidates for hypernym relationships.
We use a shallow parsing to extract, for a category , its headword , its prefix , and its suffix (postfix) , that is, . Consider a subcategory pair :
If , and then is a superclass of .
- 2.
If , and then is a superclass of .
- 3.
If and or then there is no subclass relationship between and .
Case (1) covers the example of Realms and Dwarven Realms, while case (2) allows to infer, for instance, that Elves is a superclass of Elves of Gondolin. Case (3) allows to infer that certain categories are not superclasses of each other, e.g., Gondor and Lords of Gondor. Each of subclass and no-subclass inference are implemented as binary 0/1 features.
Only Plural Parent
True if for a subcategory pair , has no other parent categories, and is in plural form (Gupta et al., 2016a).
B. Semantic Features
WordNet Hypernym Matching
WordNet is a carefully handcrafted lexical database that contains semantic relations between words and word senses (synsets), including hypo/hypernym relations. To leverage this resource, we map categories to WordNet synsets, using context-based similarity to identify the right word sense in the case of ambiguities. To compute the context vectors of categories, we extract their definitions, that is, the first sentence from the Wiki pages of the categories (if existing), and their parent and child class names. As context for WordNet synsets we use the definition (gloss) of each sense. We then compute cosine similarities over the resulting bags-of-words, and link each category with the position-adjusted most similar WordNet synset (see Alg. 1). Then, given categories and with linked WordNet synset and , respectively, this feature is true if is a WordNet hypernym of .
Wikidata Hypernym Matching
Similarly to WordNet, Wikidata also contains relations between entities. For example, Wikidata knows that Maiar is an instance (P31) of Middle-earth races in the The Lord of the Rings. While Wikidata’s coverage is generally lower than that of Wordnet, its content is sometimes complementary, as WordNet does not know certain concepts, e.g., Maiar.
Page Type Matching
One interesting contribution of the WiBi system (Flati et al., 2014) was to use the first sentence of Wikipedia pages to extract hypernyms. First sentences frequently define concepts, e.g., “The Haradrim, known in Westron as the Southrons and once as the “Swertings” by Hobbits, were a race of Men from Harad in the region of Middle-earth directly south of Gondor”. For categories having matching articles in the Wikis, we rely on the first sentence from these. We use the Stanford Parser (Manning et al., 2014) on the definition of the category to get a dependency tree. By extracting nsubj, compound and conj dependencies, we get a list of hypernyms for the category. For example, for Haradrim we can extract the relation nsubj(race-13, Haradrim-2), hence race is a hypernym of Haradrim. After getting hypernyms for a category, we link these hypernyms to classes in the taxonomies by using head word matching, and set this feature to true for any pair of categories linked this way.
WordNet Synset Description Type Matching
Similar to page type matching, we also extract superclass candidates from the description of the WordNet synset. For instance, given the WordNet description for Werewolves: “a monster able to change appearance from human to wolf and back again”, we can identify Monster as superclass.
Distributional Similarity
The distributional hypothesis states that similar words share similar contexts (Harris, 1954), and despite the subclass relation being asymmetric, symmetric similarity measures have been found to be useful for taxonomy construction (Shwartz et al., 2016). In this work, we utilize two distributional similarity measures, a symmetric one based on the structure of WordNet, and an asymmetric one based on word embeddings. The symmetric Wu-Palmer score compares the depth of two synsets (the headwords of the categories) with the depth of their least common subsumer ( (Wu and Palmer, 1994). For synsets and , it is computed as:
[TABLE]
The HyperVec score (Nguyen et al., 2017) not only shows the similarity between a category and its hypernym, but is also directional. Given categories and , with stemmed head words respectively, the HyperVec score is computed as:
[TABLE]
where is the embedding of word . Specifically, we are using Word2Vec (Mikolov et al., 2013) to train a distributional representation over Wikia documents. The term represents the cosine similarity between two embeddings, the Euclidean norm of an embedding. While WordNet only captures similarity between general concepts, embedding-based measures can cover both conceptual and non-conceptual categories, as often needed in the fantasy domain (e.g. similarity between Valar and Maiar).
C. Graph-based Features
Common Children Support
Absolute number of common children (categories and instances) of two given categories. Presumably, the more common children two categories have, the more related to each other they are.
Children Depth Ratio
The ratio between the number of child categories of the parent of the edge, and its average depth in the taxonomy. This feature models the generality of the parent candidate.
The features for edge cleaning combine existing state-of-the-art features (Head word matching, Page type matching, HyperVec) with adaptations specific to our domain (Wikidata hypernym matching, WordNet synset matching), and new graph-based features. Section 7 shows that this feature set allows to surpass the state-of-the-art in edge cleaning by 6-8% F1-score.
6. Top-level Construction
Category systems from Wiki sources often rather resemble forests than trees, i.e., do not reach towards very general classes, and miss useful generalizations such as man-made structures or geographical features for fortresses and rivers. While works geared towards Wikipedia typically conclude with having identified classes and subclasses (Ponzetto and Strube, 2007, 2011; de Melo and Weikum, 2010; Flati et al., 2014; Gupta et al., 2016a), we aim to include generalizations and abstract classes consistently across universes. For this purpose, TiFi employs as third step the integration of selected abstract WordNet classes. The integration proceeds in three steps:
- (1)
Given the taxonomy constructed so far, nodes are linked to WordNet synsets using Algorithm 1. Where the linking is successful, WordNet hypernyms are then added as superclasses. For example, the category Birds is linked to the WordNet synset bird%1:05:00::, whose superclasses are wn_vertebrate wn_chordate wn_animal wn_organism wn_living_thing wn_whole wn_object wn_physical_entity wn_entity. 2. (2)
The added classes are then compressed by removing those that have only a single parent and a single child, for instance, abstract_entity and physical_entity in Fig. 3 (right) would be removed, if they really had only one child. 3. (3)
We correct a few WordNet links that are not suited for the fictional domain, and use a self-built dictionary to remove 125 top-level WordNet synsets that are too abstract to add value, for instance, whole, sphere and imagination.
Note that the present step can add subclass relationships between existing classes. In Fig. 3, after edge filtering, there is no relation between Birds and Animals, while after linking to WordNet, the subclass relation between Birds and Animals is added, making the resulting taxonomy more dense and useful.
7. Evaluation
In this section we evaluate the performance of the individual steps of the TiFi approach, and the ability of the end-to-end system to build high-quality taxonomies.
Universes
We use 6 universes that cover fantasy (LoTR, GoT), science fiction (Star Wars), animated sitcom (Simpsons), video games (World of Warcraft) and mythology (Greek Mythology). For each of these, we extract their category networks from dump files of Wikia or Gamepedia. The sizes of the respective category networks, the input to TiFi, are shown in Table 1.
7.1. Step 1: Category Cleaning
Evaluation data for the first step was created using crowdsourcing, which was used to label all categories in LoTR, GoT, and random 50 from each of the other universes. Specifically, workers were asked to decide whether a given category had instances within the fictional domain of interest. We collected three opinions per category, and chose majority labels. Worker agreement was between 85% and 91%.
As baselines we employ a rule-based approach by Ponzetto & Strube (Ponzetto and Strube, 2011), to the best of our knowledge the best performing method for general category cleaning, and recent work by Marius Pasca (Pasca, 2018) that targets the aspect of separating classes from instances. Furthermore, we combine both methods into a joint filter. The results of training and testing on LoTR/GoT, respectively, each under 10-fold crossvalidation, are shown in Table 2. TiFi achieves both superior precision (+40%) and F1-score (+22%/+23%), while observing a smaller drop in recall (-18%/-15%). On both fully annotated universes the improvement of TiFi over the combined baseline in terms of F1-score is statistically significant (p-value and , respectively). The considerable difference in precision is explained largely by the limited coverage of the rule-based baseline. Typical errors TiFi still makes are cases where categories have the potential to be relevant, yet appear to have no instances, e.g., song in LOTR. Also, it occasionally misses out on conceptual categories which do not have plural forms, e.g., Food.
A characteristic of fiction is variety. As our approach requires labeled training data, a question is to which extent labeled data from one domain can be used for cleaning categories of another domain. We thus next evaluate the performance when applying models trained on LoTR on the other 5 universes, and the model trained on GoT on LoTR. The results are shown in Table 3, where for universes other than LoTR and GoT, having annotated only 50 samples. As one can see, F1-scores drop by only 9%/2% compared with same-domain training, and the F1-score is above 70% even for quite different domains.
To explore the contribution of each feature, we performed an ablation test using recursive feature elimination. The most important feature group were lexical features (30%/10% F1-score drop if removed in LoTR/GoT), with plural form checking being the single most important feature. In contrast, removing the graph-based features lead only to a 10%/0% drop, respectively.
7.2. Step 2: Edge Cleaning
We used crowdsourcing to label all edges that remained after cleaning noisy categories from LoTR, GoT, and random 100 edges in each of the other universes. For example, we asked Turker whether in LOTR, Uruk-hai are Orc Man Hybrids. Inter-annotator agreement was between 90% and 94%.
We compare with two state-of-the-art systems: (1) HEAD (Gupta et al., 2016a), the most recent system for Wikipedia category relationship cleaning, and (2) HyperVec (Nguyen et al., 2017), a recent embedding-based hypernym relationship learning system. The results for in-domain evaluation using 10-fold crossvalidation are shown in Table 4. As one can see, TiFi achieves a comparable precision (-2%/+2%), and a superior recall (+15%/+13%), resulting in a gain in F1-score of 6%/8%. Again, the F1-score improvement of TiFi over HyperVec and HEAD on the two fully annotated universes is statistically significant (p-values , , and , respectively).
To explore the scalability of TiFi, we again perform cross-domain experiments using 100 labeled edges per universe. The results are shown in Table 5. In all universes but World of Warcraft, TiFi achieves more than 80% F1-score, and the performance is further highlighted by mean average precision (MAP) scores above 89%, meaning TiFi can effectively separate correct from incorrect edges.
As mentioned earlier, taxonomy induction on real-world domain can leverage a lot of semantic knowledge like WordNet synsets, while fiction frequently contains non-standard categories such as Valar and Tatyar. We further evaluate the performance of TiFi by distinguishing two types of edges:
Concept edges: Both parent and child exist in WordNet.
Proper-name edges: At least one of parent and child does not exist in WordNet.
In The Lord of the Rings, there are 145 proper-name edges and 407 concept edges, while in Game of Thrones, there are 61 and 329 of each, respectively. Table 6 reports the performance of TiFi, comparing to HEAD and HyperVec on both types of edges. As one can see, for proper-name edges, TiFi achieves a very high precision of 92%/96%, outperforms HEAD by 4%/10% and HyperVec by 14%/53% in F1-score, respectively.
We again performed an ablation test in order to understand feature contribution. We found that all three groups of features have importance, observing a 1-4% drop in F1-score when removing any of them. The individually most important features were Only Plural Parent, Headword Matching, Common Children Support and Page Type Matching.
7.3. Step 3: Top-level Construction
The key step in top-level construction is the linking of categories to WordNet synsets (i.e. category disambiguation), hence we only evaluate this step. For this purpose, in each universe, we randomly selected 50 such links and evaluated their correctness, finding precisions between 84% and 92% (see Table 7). Overall, this step is able to link 30-72% of top-level classes from Step 2, and adds between 22 to 373 WordNet classes and 76 to 3387 subclass relationships to our universes.
7.4. Final Taxonomies
Table 8 summarizes the taxonomies constructed for our 6 universes, with the bottom 4 universes built using the models for GoT. Reported precisions refer to the weighted average of the precision of subclass edges from Step 2, and the precision of WordNet linking. Figure 4 shows the resulting taxonomy for Greek Mythology, rendered using the R layout fruchterman.reingold. All taxonomies will be made available both as CSV and graphically.
7.5. Wikipedia as Input
While our method is targeted towards fiction, it is also interesting to know how well it does in the traditional Wikipedia setting. To this end, we extracted a specific slice of Wikipedia, namely all categories that are subcategories of Desserts, resulting in 198 categories connected by 246 subcategory relations, which we fully labeled.
Using 10-fold crossvalidation, in the first step, category cleaning, our method achieves 99% precision and 99% recall, which puts it on par with Ponzetto & Strube (Ponzetto and Strube, 2011), which achieves 99% precision and 100% recall. The reason for the excellent performance of both systems is that noise in Wikipedia categories concerns fairly uniformly meta-categories, which can be well filtered by enumerating them. In the second step, edge cleaning, TiFi also achieves comparable results, with a slightly lower precision (83% vs. 87%) and a slightly higher recall (92% vs. 89%), resulting in 87% F1-score for TiFi vs. 88% for HEAD.
7.6. WebIsALOD as Input
WebIsALOD (Hertling and Paulheim, 2017) is a large collection of hypernymy relations extracted from the general web (Common Crawl). Relying largely on pattern-based extraction, the data from WebisALOD is very noisy, especially beyond the top-confidence ranks. Being text-based, several features based on category systems become unavailable, making this source an ideal stress test for the TiFi approach.
Data:
To get data from WebisALOD, we selected the top 100 most popular entities from two universes, The Lord of the Rings and Simpsons, 100 per each, based on the frequency of their mentions in text. We then queried the hypernyms of these entities and took the top 3 hypernyms based on ranking of confidences cores (minimum confidence 0.2). We iterated this procedure once with the newly gained hypernyms. In the end, with The Lord of the Rings, we get 324 classes and 312 hypernym relations, meanwhile, with Simpsons, these numbers are 271 classes and 228 hypernym relations. We fully manual label these datasets by checking whether classes are noisy and hypernym relations are wrong. From the labeled data, only 217 classes (67%) and 167 classes (62%) should be kept in The Lord of the Rings and Simpsons, respectively. In the case of hypernym relations, only 42% and 47% of them are considered to be correct relations in The Lord of the Rings and Simpsons, respectively. These statistics confirm that the data from WebisALOD is very noisy.
Results:
In Step 1, Ponzetto & Strube (Ponzetto and Strube, 2011) use lexical rules to remove meta-categories, while Pasca (Pasca, 2018) uses heuristics which are based on information extracted from Wikipedia pages to detect entities that are classes. To enable comparison with Pasca’s work, we used exact lexical matches to link classes from WebIsALOD to Wikipedia pages titles, then used Wikipedia pages as inputs. In fact, classes from WebisALOD are hardly meta-categories and the additional data from Wikipedia is also quite noisy. Table 9 shows that TiFi still performs very well in category cleaning, and significantly outperforms the baselines by 10%/20% F1-score.
In Step 2, HEAD uses heuristics to clean hypernym relations between classes, mostly based on lexical and information from class pages (e.g. Wikipedia pages). Although TiFi also uses the information from class pages, its supervised model uses also a set of other features and is thus more versatile. Table 10 reports the results of TiFi, comparing with HEAD in edge cleaning, with TiFi outperforming HEAD by 28%-53% F1-score.
Both steps were also evaluated in the cross-domain settings, with similar results (90%/91% F1-score in step 1, 53%/55% F1-score in step 2).
8. Use Case: Entity Search
To highlight the usefulness of our taxonomies, we provide an extrinsic evaluation based on the use case of entity search. Entity search is a standard problem in information retrieval, where often, textual queries shall return lists of matching entities. In the following, we focus on the retrieval of correct entities only, and disregard the ranking aspect.
Setup
We consider three universes, The Lord of the Rings, Simpsons and Greek Mythology, and manually generated 90 text queries belonging to the following categories (10 of each per universe):
- (1)
Single type: Entities belonging to a class, e.g., Orcs in the Lords of the Rings; 2. (2)
Type intersection: Entities belonging to two classes, e.g., Humans that are agents of Saruman; 3. (3)
Type difference: Entities that belong to one class but not another, e.g., Spiders that are not servants of Sauron.
We utilize the following resources:
- •
Unstructured resources: (1) Google Web Search and (2) the Wikia-internal text search function;
- •
Structured resources: (3) the Wikia category networks and (4) the taxonomies as built by TiFi.
Evaluation
For the unstructured resources, we manually checked the titles of the top 10 returned pages for correctness.
For the structured resources, we matched the classes in the query against all classes in the taxonomy that contained those class names as substrings. We then computed, in a breadth-first manner, all subclasses and all instances of these classes, truncating the latter to maximal 10 answers, and manually verified whether returned instances were correct or not.
Results
Table 11 reports for each resource the average number of results and their precision. We find that Google performs worst mainly because its diversification is limited (returns distinct answers often only far down in the ranking), and because it cannot well process conjunction and negation. Wikia performs better in terms of answer size, as by design it contains each entity only once. Still, it struggles with logical connectors. The Wikia categories produce more results than TiFi (9 vs. 6 on average), though due noise, they yield a substantially lower precision (-24%). This corresponds to the core of the TiFi approach, which in step 1 and 2 is cleaning, i.e., leads to a lower recall while increasing precision.
Table 12 lists three sample queries along with their output. Crossed-out entities are incorrect answers. As one can see, text search mostly fails in answering the queries that use boolean connectives, while the original Wikia categories are competitive in terms of the number of answers, but produce many more wrong answers.
9. Conclusion
In this paper we have introduced TiFi, a system for taxonomy induction for fictional domains. TiFi uses a three-step architecture with category cleaning, edge cleaning, and top-level construction, thus building holistic domain specific taxonomies that are consistently of higher quality than what the Wikipedia-oriented state-of-the-art could produce.
Unlike most previous work, our approach is not based on static rules, but uses supervised learning. This comes with the advantage of allowing to rank classes and edges, for instance, in order to distinguish between core elements, less or marginally relevant ones, and totally irrelevant ones. In turn it also necessitates the generation of training data, yet we have shown that training data can be reasonably reused across domains.
Mirroring earlier experiences of YAGO (Suchanek et al., 2007), it also turns out that a crucial step in building useful taxonomies is the incorporation of abstract classes. For TiFi we relied on the established WordNet hierarchy, nevertheless finding the need to adapt a few links, and to remove certain too abstract concepts.
So far we only applied our system to fictional domains and one slice of Wikipedia. In the future, we would like to explore the construction of more domain-specific but real-world taxonomies, such as gardening, Maya culture or Formula 1 racing.
Code and taxonomies will be made available on Github.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Agarwal et al . (2015) Apoorv Agarwal, Jiehan Zheng, Shruti Kamath, Sriramkumar Balasubramanian, and Shirin Ann Dey. 2015. Key Female Characters in Film have more to talk about besides men: Automating the Bechdel Test. In NAACL . 830–840.
- 3Alfarone and Davis (2015) Daniele Alfarone and Jesse Davis. 2015. Unsupervised Learning of an is-a Taxonomy from a Limited Domain-specific Corpus. In IJCAI . 1434–1441.
- 4Bamman et al . (2014) David Bamman, Brendan O’Connor, and Noah A Smith. 2014. Learning Latent Personas of Film Characters. In ACL . 352.
- 5Bodenreider (2004) Olivier Bodenreider. 2004. The Unified Medical Language System (UMLS): Integrating Biomedical Terminology. Nucleic acids research (2004), D 267–D 270.
- 6Chambers and Jurafsky (2009) Nathanael Chambers and Dan Jurafsky. 2009. Unsupervised Learning of Narrative Schemas and their Participants. In ACL/IJCNLP . 602–610.
- 7Chaturvedi et al . (2017) Snigdha Chaturvedi, Mohit Iyyer, and Hal Daumé III. 2017. Unsupervised Learning of Evolving Relationships Between Literary Characters. In AAAI . 3159–3165.
- 8Cimiano et al . (2005) Philipp Cimiano, Andreas Hotho, and Steffen Staab. 2005. Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis. J. Artif. Intell. Res. 24 (2005), 305–339.
