Orometric Methods in Bounded Metric Data

Maximilian Stubbemann; Tom Hanika; Gerd Stumme

arXiv:1907.09239·cs.AI·October 27, 2020

Orometric Methods in Bounded Metric Data

Maximilian Stubbemann, Tom Hanika, Gerd Stumme

PDF

1 Repo

TL;DR

This paper introduces a novel method that applies orometric measures, originally used for topographic analysis, to bounded metric data in knowledge graphs, enabling the identification of significant items such as geographic entities.

Contribution

It transfers orometric valuation functions to metric data in knowledge graphs and demonstrates their effectiveness in item ranking and recommendation tasks.

Findings

01

Effective identification of relevant geographic entities in Wikidata.

02

Oromatic measures improve item ranking accuracy in supervised learning.

03

Method generalizes topographic analysis to metric data in knowledge graphs.

Abstract

A large amount of data accommodated in knowledge graphs (KG) is actually metric. For example, the Wikidata KG contains a plenitude of metric facts about geographic entities like cities, chemical compounds or celestial objects. In this paper, we propose a novel approach that transfers orometric (topographic) measures to bounded metric spaces. While these methods were originally designed to identify relevant mountain peaks on the surface of the earth, we demonstrate a notion to use them for metric data sets in general. Notably, metric sets of items inclosed in knowledge graphs. Based on this we present a method for identifying outstanding items using the transferred valuations functions 'isolation' and 'prominence'. Building up on this we imagine an item recommendation process. To demonstrate the relevance of the novel valuations for such processes we use item sets from the Wikidata…

Tables2

Table 1. Table 1 . Basic statistics of the country datasets extracted from wikidata.

	Municipalities	University Locations
France	2063	92
Germany	2863	164

Table 2. Table 2 . Confusion Matrix

	Predicted Negative	Predicted Positive
Actual Negative	TN (True Negative)	FP (False Positive)
Actual Positive	FN (False Negative)	TP (Tue Positive)

Equations14

iso (m) : = in f {d (m, n) ∣ n \in M ∖ {m} \land h (n) \geq h (m)} .

iso (m) : = in f {d (m, n) ∣ n \in M ∖ {m} \land h (n) \geq h (m)} .

E_{δ} : - {{m, n} \in (2 M) ∣ d (m, n) \leq δ}}

E_{δ} : - {{m, n} \in (2 M) ∣ d (m, n) \leq δ}}

prom_{G} (v) : = min {h (v), mindesc_{G} (v)}

prom_{G} (v) : = min {h (v), mindesc_{G} (v)}

δ_{M} : = sup {in f {d (m, n) ∣ n \in M ∖ {m}} ∣ m \in M} .

δ_{M} : = sup {in f {d (m, n) ∣ n \in M ∖ {m}} ∣ m \in M} .

δ ↘ δ_{M} lim prom_{δ} (m)

δ ↘ δ_{M} lim prom_{δ} (m)

prom_{(.)} (m) :] δ_{M}, \hat{δ} [\to R, δ \mapsto prom_{δ} (m) .

prom_{(.)} (m) :] δ_{M}, \hat{δ} [\to R, δ \mapsto prom_{δ} (m) .

prom (m) : = δ ↘ δ_{M} lim prom_{δ} (m) .

prom (m) : = δ ↘ δ_{M} lim prom_{δ} (m) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mstubbemann/Orometric-Methods-in-Bounded-Metric-Data
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Orometric Methods in Bounded Metric Data

Maximilian Stubbemann

L3S Research Center and University of KasselKasselGermany

[email protected]

,

Tom Hanika

Berlin School of Library and Information Science

Humboldt University of BerlinBerlinGermany

[email protected]

and

Gerd Stumme

University of Kassel and L3S Research CenterKasselGermany

[email protected]

(2019)

Abstract.

A large amount of data accommodated in knowledge graphs (KG) is actually metric. For example, the Wikidata KG contains a plenitude of metric facts about geographic entities like cities, chemical compounds or celestial objects. In this paper, we propose a novel approach that transfers orometric (topographic) measures to bounded metric spaces. While these methods were originally designed to identify relevant mountain peaks on the surface of the earth, we demonstrate a notion to use them for metric data sets in general. Notably, metric sets of items inclosed in knowledge graphs. Based on this we present a method for identifying outstanding items using the transferred valuations functions ’isolation’ and ’prominence’. Building up on this we imagine an item recommendation process. To demonstrate the relevance of the novel valuations for such processes we use item sets from the Wikidata knowledge graph. We then evaluate the usefulness of ’isolation’ and ’prominence’ empirically in a supervised machine learning setting. In particular, we find structurally relevant items in the geographic population distributions of Germany and France.

metric spaces, orometric functions, knowledge graphs, classification

††copyright: acmcopyright††journalyear: 2019††doi: TBA††conference: —; —; —††ccs: Information systems Novelty in information retrieval††ccs: Mathematics of computing Graph algorithms††ccs: Mathematics of computing Graphs and surfaces††ccs: Computing methodologies Heuristic function construction††ccs: Computing methodologies Semantic networks

1. Introduction

Knowledge graphs, such as DBpedia (Lehmann et al., 2015) or Wikidata (Vrandečić and Krötzsch, 2014), are the state of the art structure for storing information and to draw knowledge from. They are knowledge bases represented as graphs and consist essentially of items which are related through properties and values. This enables them to fulfill the task of giving exact answers to exact questions. However, they are limited when it comes to provide a concise overview of the contained metric data and give characteristic insights. For example, the number of such metric data sets in Wikidata is tremendous: since (presumably) the set of all cities of the world, including their geographic coordinates, is included in Wikidata, this constitutes a metric data set. Further examples are chemical compounds and their physical properties like mass and size or celestial bodies and their trajectories.

One possibility to enhance the understanding of the metric data is to identify outstanding elements, i.e., outstanding items. Based on such elements it is possible to compose or enhance item recommendations to users. For example, such recommendations could provide a set of the most relevant cities in the world with respect to being outstanding in their local surroundings. However, it is a challenging task to identify outstanding items in metric data sets. In cases where the metric space is equipped with an additional valuation function, this task becomes more feasible. Such functions, often called scores or height function, are frequently naturally provided: cities may be ranked by their population; the importance of scientific publications maybe ranked by the $h$ -index (Hirsch, 2005) of their corresponding authors. A naïve approach for recommending relevant items in such settings would be based on the claim: items with higher scores are more relevant items. As this method seems reasonable for many applications, some obstacles may arise if the “highest” items of the topic may be concentrated into a specific region of the underlying metric data space. For example, returning the twenty most populated cities in the world as an overview for the city landscape would return no European city111https://en.wikipedia.org/wiki/List_of_largest_cities on 2019-06-16, recommending the hundred highest mountain peaks of the world would not lead to any knowledge about the mountainscapes outsides of Asia222https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth on 2019-06-16.

To overcome this problem we propose a novel approach: we combine the valuation measure (e.g., “height”) and distances drawn from the metric in order to provide new valuation functions on the set of items, called prominence and isolation. In contrast to the naïve approach, those functions do value an item based on its height in relation to the valuations of the surrounding items. This results in a valuation function on the set of items that reflects the extend to which an item is locally outstanding. The basic idea behind the novel valuation functions is the following. The prominence function values an item based on the minimal descent (with respect to the height function) that is needed to get to another point of at least same height. Furthermore, the isolation function, sometimes also called dominance radius, values the distance to the next higher point with respect to the given metric and height function. These measures are adapted from the field of topography where topographic isolation and topographic prominence are used in order to identify outstanding mountain peaks. Our approach is based on (Schmidt and Stumme, 2018), where the authors Schmidt & Stumme proposed prominence and dominance for networks. We will transfer and adapt these through generalization to the realm of bounded metric space.

To give a first insight to the potential of the novel valuation functions in knowledge graphs, we will empirically verify their ability to identify relevant items for a given topic. For this we employ a supervised machine learning task. We evaluate if isolation and prominence functions can contribute to the task of identifying relevant items in the sets of French and German cities.

The contributions of this paper are as follows:

• We propose prominence and isolation for bounded metric spaces. For this we generalize the results in (Schmidt and Stumme, 2018) which were limited to finite, undirected graphs.

• We demonstrate an artificial machine learning task for evaluating novel valuation functions in metric data.

• We introduce a general approach for using prominence and isolation to enrich metric data in knowledge graphs. We show empirically that this information helps to identify a set of representative items.

The remainder of this paper is organized as follows. In Section 2 we give a short overview over related work. This is followed by Section 3 were the necessary mathematical foundation is laid out. Section 4 gives a first insight in how the novel valuation functions can be employed in a possible recommendation process. We evaluate this in Section 5 and conclude our work within Section 6.

2. Related Work

Item recommendations for knowledge graphs is a contemporary topic of high interest in research. Investigations cover for example music recommendation using content and collaborative information (Oramas et al., 2016) or movie recommendations using PageRank like methods (Catherine and Cohen, 2016). The former is based on the common notion of embedding, i.e., embedding of the graph structure into $d$ -dimensional $\mathbb{R}$ vector spaces. The latter operates on the relational structure itself. Our approach differs from those as it is based on combining a valuation measure with the metric of the data space. Nonetheless, given an embedding into an finite dimensional $\mathbb{R}$ vector space, one could apply isolation and prominence in those as well.

The novel valuation functions prominence and isolation are inspired by topographic measures, which have their origin in the classification of mountain peaks. The idea of ranking peaks solely by their absolute height was already deprecated in 1978 by Fry in his work (Fry, 1987). The author introduced prominence for geographic mountains, a function still investigated in this realm, e.g., in Torres et. Al. (Torres et al., 2018), where the authors use deep learning methods to identify prominent mountain peaks. Another recent step for this was made in (Kirmse and de Ferranti, 2017), where the authors investigated methods for discovering new ultra-prominent mountains. Isolation and more valuations functions motivated in the orometric realm are collected in (Helman, 2005).

Recently the idea of transferring orometric functions to different realms of research gained attention: The authors of (Nelson and McKeon, 2019) used topographic prominence to identify population areas in several U.S. States. In (Schmidt and Stumme, 2018) the authors Schmidt & Stumme transferred prominence and dominance, i.e., isolation, to co-author graphs in order to evaluate their potential of identifying ACM Fellows. We build on this for proposing our valuation functions on bounded metric data. This generalization results in a wide range of applications.

3. Mathematical Modeling

Let us consider the following scenario: We have a data set $M$ , consisting of a set of items, in the following called points, equipped with a metric $d$ and a valuation function $h$ , in the following called height function. The goal of the orometric (topographic) measures prominence and isolation is, to provide measures that reflect the extend to which a point is locally outstanding in its neighborhood.

Let $M$ be a non-empty set and $d:M\times M\to\mathbb{R}_{\geq 0}$ . We call $d$ a metric on the set $M$ iff

(1) $\forall x,y\in M:d(x,y)=0\iff x=y$ , and

(2) $d(x,y)=d(y,x)$ for all $x,y\in M$ , called symmetry, and

(3) $\forall x,y,z\in M:d(x,z)\leq d(x,y)+d(y,z)$ , called triangle inequality.

If $d$ is a metric on $M$ , we call $(M,d)$ a metric space and if $M$ is finite we call $(M,d)$ a finite metric space. If there exists a $C\in\mathbb{R}_{\geq 0}$ such that we have $d(m,n)\leq C$ for all $m,n\in M$ , we call $(M,d)$ bounded. For the rest of our work we assume that $|M|>1$ and $(M,d)$ is a bounded metric space. Additionally, we have that $M$ is equipped with a height function (valuation / score function) $h:M\to\mathbb{R}_{\geq 0},m\mapsto h(m)$ .

Definition 3.1 (Isolation).

Let $(M,d)$ be a bounded metric space and let $h:M\to\mathbb{R}_{\geq 0}$ be a height function on M. The isolation of a point $x\in M$ is then defined as follows:

•

If there is no point with at least equal height to $m$ , than $\operatorname{iso}(m)\coloneqq\sup\{d(m,n)\mid n\in M\}$ . The boundedness of $M$ guarantees the existence of this suprenum.

•

If there is at least one other point in $M$ with at least equal height to $m$ , we define its isolation by:

[TABLE]

The isolation of a mountain peek is often called the dominance radius or sometimes the dominance. Since the term orometric dominance of a mountain sometimes refers to the quotient of prominence and height, we will stick to the term isolation to avoid confusion.

While the isolation can be defined within the given setup, we have to equip our metric space with some more structure in order to transfer the notion of prominence. Informally, the prominence of a point is given by the minimal vertical distance one has to descend to get to a point of at least the same height. To adapt this measure to our given setup in metric spaces with a height function, we have to define what a path is. Structures that provide paths in a natural way are graph structures. For a given graph $G=(V,E)$ with vertex set $V$ and edge set $E\subseteq{V\choose{2}}$ , walks are defined as sequences of nodes $\{v_{i}\}_{i=0}^{n}$ which satisfy $\{v_{i-1},v_{i}\}\in E$ for all $i\in\{1,...,n\}$ . If we also have $v_{i}\neq v_{j}$ for $i\neq j$ , we call such a sequence a path. For $v,w\in V$ we say $v$ and $w$ are connected iff there exists path connecting them. Furthermore, we denote by $G(v)$ the connected component of $G$ containing $v$ , i.e., $G(v)\coloneqq\{w\in V\mid v\ \text{is connected with}\ w\}$ .

To use the prominence measure as introduced by Schmidt & Stumme in (Schmidt and Stumme, 2018), which is indeed defined on graphs, we have to derive an appropriate graph structure from our metric space.

The topic of graphs embedded in finite dimensional vector spaces, so called spatial networks (Barthélemy, 2011), is a topic of current interest. These networks appear in real world scenarios frequently, for example in the modeling of urban street networks (Jiang and Claramunt, 2004). Note that our setting, in contrast to the afore mentioned, is not based on a priori given graph structure. In our scenario the graph structure must be derived from the structure of the given metric space.

Our approach is, to construct a step size graph or threshold graph, where we consider points in the metric space as nodes and connect two points through an edge, iff their distance is smaller then a given threshold $\delta$ .

Definition 3.2.

( $\delta$ -Step Graph) Let $(M,d)$ be a metric space and $\delta>0$ . We define the $\delta$ -step graph or $\delta$ -threshold graph, denoted by $G_{\delta}$ , as the tuple $\left(M,E_{\delta}\right)$ via

[TABLE]

This approach is similar to the one found in the realm of random geometric graphs, where it is common sense to define random graphs by placing points uniformly in the plane and connect them via edges if their distance is less than a given threshold (Penrose, 2003).

Since we introduced a possibility to derive a graph that just depends on the metric space, we use a slight modification of the definition of prominence compared to (Schmidt and Stumme, 2018) for networks.

Definition 3.3 (Prominence in Networks).

Let $G=(V,E)$ be a graph and let $h:V\to\mathbb{R}_{\geq 0}$ be a height function. The prominence $\operatorname{prom}_{G}(v)$ of $v\in V$ is defined by

[TABLE]

where $\operatorname{mindesc}_{G}(v)\coloneqq\inf\{\max\{h(v)-h(u)\mid u\in p\}\mid p\in P_{v}\}$ . The set $P_{v}$ contains all paths to vertices $w$ with $h(w)\geq h(v)$ , i.e., $P_{v}\coloneqq\{\{v_{i}\}_{i=0}^{n}\in P\mid v_{0}=v\wedge v_{n}\neq v\wedge h(v_{n})\geq h(v)\}$ , where $P$ denotes the set of all paths of the graph $G$ .

Informally, $\operatorname{mindesc}_{G}(v)$ reflects on the minimal descent in order to get to a vertex in $G$ which has a height of at least $h(v)$ . For this the definition makes use of the fact that $\inf\emptyset=\infty$ in cases where no such point exists. This case results in $\operatorname{prom}_{G}(v)$ being the height of $v$ . An essential distinction to the prior definition in (Schmidt and Stumme, 2018) is, that we now consider all paths and not just shortest paths. Based on this we are able to transfer the notions above to metric spaces.

Definition 3.4 ( $\delta$ -Prominence in Metric Spaces).

Let $(M,d)$ be a bounded metric space and $h:M\to\mathbb{R}_{\geq 0}$ be a height function. We define the $\delta$ -prominence $\operatorname{prom}_{\delta}(m)$ of $m\in M$ as $\operatorname{prom}_{G_{\delta}}(v)$ , i.e, the prominence of $m$ in the step graph $G_{\delta}$ from Definition 3.2.

We now have a prominence term for all metric spaces that depends on a parameter $\delta$ to choose. For all knowledge procedures, choosing such a parameter is a demanding task. Hence, we want to provide in the following a natural choice for $\delta$ . The ideas for this is informally the following: We consider only those values for $\delta$ such that corresponding $G_{\delta}$ does not exhibit noise, i.e., there is no element without a neighbor. In other words, we allow only those values of $\delta$ such that $\forall m\in M\exists e\in E_{\delta}:m\in e$ .

Definition 3.5 (Minimal Threshold).

For $(M,d)$ a bounded metric space with $|M|>1$ we define the minimal threshold $\delta_{M}$ of $M$ as

[TABLE]

Based on this definition a natural notion of prominence for metric spaces (equipped with a height function) emerges via a limit process.

Lemma 3.6.

Let $M$ be a bounded metric and $\delta_{M}$ as in Definition 3.5. For $m\in M$ the descending limit

[TABLE]

exists.

Proof.

Fix any $\hat{\delta}>\delta_{M}$ and consider on the open interval from $\delta_{M}$ to $\hat{\delta}$ the function that maps $\delta$ to $\operatorname{prom}_{\delta}(m)$ :

[TABLE]

It is well known that it is sufficient to show that $\operatorname{prom}_{(.)}(m)$ is monotone decreasing and bounded from above. Since we have for any $\delta$ that $\operatorname{prom}_{\delta}(m)\leq h(m)$ holds, we need to show the monotony. Let $\delta_{1},\delta_{2}$ be in $]\delta_{M},\hat{\delta}[$ with $\delta_{1}\leq\delta_{2}$ . If we consider the corresponding graphs $(M,E_{\delta_{1}})$ and $(M,E_{\delta_{2}})$ , it easy to see $E_{\delta_{1}}\subseteq E_{\delta_{2}}$ . Hence, we have to consider more paths in Equation 2 for $E_{\delta_{2}}$ , resulting in a not larger value for the infimum. We obtain $\operatorname{prom}_{\delta_{1}}(m)\geq\operatorname{prom}_{\delta_{2}}(m)$ , as required. ∎

This leads in a natural way directly to the following definition.

Definition 3.7 (Prominence in Metric Spaces).

If $M$ is a bounded metric space with $|M|>1$ and a height function $h$ , the prominence $\operatorname{prom}(m)$ of $m$ is defined as:

[TABLE]

Note, if we want to compute prominence on a real world finite metric data set, it is possible to directly compute the prominence values: in that case the supremum in Definition 3.5 can be replaced by a maximum and the infimum by a minimum, which leads to $\operatorname{prom}(m)$ being equal to $\operatorname{prom}_{\delta_{M}}(m)$ . Hence, we can compute prominence and isolation for every point in the finite data set. There are results for efficiently creating such threshold graphs (Bentley, 1975). However, for our needs in this work, in particular in the experiment section, a quadratic brute force approach for generating all edges is sufficient. We want to show that our prominence definition for bounded metric spaces is a natural generalization of Definition 3.3.

Lemma 3.8.

Let $G=(V,E)$ be a finite, connected graph with $\left|V\right|\geq 2$ . Consider $V$ equipped with the shortest path metric as a metric space. Then the prominence $\operatorname{prom}_{G}(\cdot)$ from Definition 3.3 and $\operatorname{prom}(\cdot)$ from Definition 3.7 coincide.

Proof.

Let $M\coloneqq V$ be equipped with the shortest path metric $d$ on $G$ . As $G$ is connected and has more than one node, we have $\delta_{M}=1$ . This yields that $(M,E_{\delta_{M}})$ from Definition 3.2 and $G$ are equal. Hence, the prominence terms coincide. ∎

4. Application

4.1. Score based item recommending

As an application of our valuation functions, we envisage a general approach for a score based item recommending process. The task of item recommending in knowledge graphs is a current research topic. However, most approaches are solely based on knowledge about preferences of the given user and graph structural properties, often accessed through knowledge graph embeddings. The idea of the recommendation process we imagine differs from those. We stipulate on a procedure that is based on the information entailed in the connection of the metric aspects of the data together with some (often present) height function. Of course, we are aware that this limits our approach to metric data in knowledge graphs, only. Nonetheless, given the large amounts of metric item sets in prominent knowledge graphs, we claim the existence of a plenitude of applications. For example, while considering sets of cities, such a system could recommend a relevant subset, based on a height function, like population, and a metric, like geographical distances. By doing so, we introduce a source of information for recommending metric data in relational structures, like knowledge graphs. A common approach for analyzing and learning in knowledge graphs is knowledge graph embedding. There is an extensive amount of research about that, see for example (Wang et al., 2014; Bordes et al., 2011). Since our novel methods rely solely on bounded metric spaces and some valuation function, one may apply those after the embedding step as well. In particular, one may use isolation and prominence for investigating or completing knowledge graph embeddings. This constitutes our second envisioned application. Finally, common item recommending scores/ranks can also be used as height functions in our sense. Hence, computing prominence and isolation for already setup recommendation systems is another possibility. Here, our valuation functions have the potential to enrich the recommendation process with additional information. In such a way our measures can provide a novel additional aspect to existing approaches.

The realization and evaluation of our proposed recommendation approach is out of scope of this paper. Nonetheless, we want to provide some first insights for the applicability of valuation functions for item sets based on empirical experiments. As a first experiment, we will evaluate if isolation and prominence help to separate important and unimportant items in specific item sets in Wikidata. More specifically, we will evaluate if the valuation functions help to differentiate important and unimportant municipalities in the countries of France and Germany, solely based on their geographic metric properties and their population as height function.

4.2. Enriching metric item sets in Wikidata

In this section we depict an universal approach for enriching finite metric item sets in Wikidata using the introduced functions isolation and prominence. In order to enhance the grasp for the reader, we accompany every step with a running example. To which extend do municipalities stand out with respect to their local surroundings, based on population (height)? The particular steps are as follows:

(1)

Identify a metric item set in the knowledge graph: For this we need to identify the metric space of all items in some considered set. One may pre-compute their pairwise distances, if applicable.

For our experiments we identify the set of German municipalities and French municipalities with their geographic coordinates in longitude and latitude and compute as well their pairwise (approximated) distances. 2. (2)

Identify height function: Since we want to compute the prominence and isolation of the items, we also have to identify a height function. Hence, we need to identify a valued property shared by all items identified in the step above which is also relevant to the enriching task.

In our running example we identify the population of the municipalities as such a relevant shared valued property. 3. (3)

Compute isolation: Based on the steps before we are now abled to compute the isolation for all items in the item sets.

For our running example, we compute the isolation for all municipalities for the item sets of Germany and France. 4. (4)

Compute the threshold graph: For computing the prominence values for all items in the item sets, we need to compute the threshold graph and the threshold $\delta_{M}$ using Definition 3.2 and Definition 3.5.

In our running examples, for Germany we compute the value $\delta_{M}\approx 32$ kilometers. This value is necessary in order to preserve a connection between Borkum (Q25082) and Krummhörn (Q559432). For the French item set we compute $\delta_{M}\approx 54$ kilometers in order to preserve the connection between Mende (Q191772) and La Grand-Combe (Q239967). 5. (5)

Compute prominences: Equipped with the threshold graph we are now able to compute the prominence values for all items using Definition 3.4.

4.3. Resulting Questions

The sections above raise the natural question for an objective evaluation of the functions prominence and isolation. In this section we present such an evaluation scheme by means of two qualitative questions connected to this task.

Assume we have given a bounded metric space $M$ representing our data set and a given height function $h$ . The aim of the research questions we propose in the following is to evaluate if our functions isolation and prominence provide useful information about the relevance of given points in the metric space. If $(M,d,h)$ is a metric space equipped with an additional height function, let the map $c:M\to\{0,1\}$ be a binary function that classifies the points in the data set as relevant (1) or not (0). We want to answer the following question to evaluate if there is a connection between the extent to which a data point is local outstanding (i.e., has high isolation and prominence) and relevance. We connect this to our running example using the classification function that classifies municipalities having a university (1) and municipalities that do not have an university (0). We admit that the underlying classification is not meaningful in itself. However, since this setup is essentially a benchmark framework (in which we assume cities with universities to be more relevant) we refrain from employing a more meaningful classification task in favor of a controllable classification scenario.

(1)

Are prominence and isolation alone characteristical for relevance?

We use isolation and/or prominence for a given set of data points as features. To which extend do these features improve learning a classification function for relevance?

This question manifests in our running example as follows: are prominence and isolation useful features to classify the university locations of France and Germany? 2. (2)

Do prominence and isolation provide additional information, not catered by the absolute height?

Do prominence and isolation improve the prediction performance of relevance compared to just using the absolute height? Does a classifier that uses prominence and isolation as additional features produce better results than a classifier that just uses the absolute height?

In the context of our running example: Do prominence and isolation of municipalities add information to the population feature, that help to characterize the university locations, compared to using the plain population value?

We will evaluate the proposed setup in the realm a knowledge graph and take on the questions stated above in the following section and present some experimental evidence.

5. Experiments

5.1. Dataset

We extract information about municipalities in the countries of Germany and France from the Wikidata knowledge graph. This knowledge graph is a structure that stores knowledge via statements, linking entities via properties to values. A detailed description can be found in (Vrandečić and Krötzsch, 2014), while (Hanika et al., 2019) gives an explicit mathematical structure to the Wikidata graph and shows how to use the graph for extracting implicational knowledge from Wikidata subsets. We investigate in the following if prominence and isolation of a given municipality can be used as features to predict university locations in a classification setup. We use the query service of Wikidata333https://query.wikidata.org/ to extract points in the country maps from Germany and France and to extract all their universities. For every relevant municipality we extract the coordinates and the population. The necessary SPAQRL queries we employed for all the followings tasks are documented in our GitHub repository444https://github.com/mstubbemann/Orometric-Methods-in-Bounded-Metric-Data for our paper project. While constructing the needed metric space, we have to overcome some obstacles.

•

Wikidata provides different relations for extracting items that are instances of the notion city. The most obvious choice is to employ the instance of (P31) property for the item city (Q515). Using this, including subclass of (P279), we find insufficient results for generating our data sets. More specific, we find only 102 French cities and 2215 German cities.555Queried on 07-08-19 For Germany, there exists a more commonly used item urban municipality of Germany (Q42744322) for extracting all cities, while to the best of our knowledge, a counterpart for France is not provided.

•

The preliminary investigation led us to use not cities but municipality (Q15284), again including the subclass of (P279) property, with more than 5000 inhabitants.

•

Since there are multiple french municipalities that are not located in the mainland of France, we encounter problems for constructing the metric space. To cope with that we draw a basic approximating square around the mainland of France and consider only those municipalities inside.

•

We find the class of every municipality, i.e, university location or non-university location, through the following approach. We use the properties located in the administrative territorial entity (P131) and headquarters location (P159) on the set of all universities and checked if these are set in Germany or France. An example of a German University that has not set P131 is TU Dortmund (Q685557).666last checked on 19-06-25

•

Using a Python script we then matched the list of municipalities with the indicated properties of the universities. This method was necessary for the following reason. Some universities are not related to municipalities through property P131. For example, the item Hochschule Niederrhein (Q1318081) is located in the administrative location North Rhine-Westphalie (Q1198), which is a federal state containing multiple municipalities. For these cases we checked the university locations manually. Some basic statistics on our dataset can be found in Table 1, a graphic overview of the municipality and university distribution is depicted in Figure 1.

•

During the construction of the data set we encounter universities that are associated to a country having neither located in the administrative territorial entity (P131) nor headquarters location (P159). There are ten German and twelve French universities for this case. We checked them manually and were able to discard them all for different reasons, for example, items that were wrongly related to the university item.

5.2. Binary Classification Task

Setup

For both France and Germany, we compute the prominence and isolation of all data points. We then normalize the population, isolation and prominence values to be in the range from [math] to $1$ . Since our data set is highly imbalanced, most of the common classifiers would tend to simply predict the majority class. A variety of methods were proposed in the past to deal with such problems. An overview can be found in (Kotsiantis et al., 2006). Sampling approaches like undersampling or oversampling via the creation of synthetic examples (Chawla et al., 2002) are an established method for dealing with such imbalances. We want to stress out again that the goal for the to be introduced classification task is not to identify the best classifier. Rather we want to produce evidence for the applicability of employing isolation and prominence as (more suitable) features for learning a classification function. Since we need a classification algorithm that provides useful predictions on single features, we decide to use logistic regression with $L^{2}$ regularization and Support Vector Machines (Cortes and Vapnik, 1995) with a radial kernel. To overcome the imbalance, we use inverse penalty weights with respect to the class distribution.

For our experiments, we use the algorithms for Support Vector Machines (SVC) and LogisticRegression that are provided by the Python library Scikit-Learn (Pedregosa et al., 2011). To solve the resulting minimization problem, our setup of Scikit-Learn uses the LIBLINEAR library, see (Fan et al., 2008). As penalty factor for the SVC we set $C=1$ , however, we also experiment with $C\in\{0.5,1,2,5,10,100\}$ . As in (Akbani et al., 2004), where the authors compared multiply methods to use support vector machines for imbalanced data sets, we choose $\gamma=1$ for our radial kernel. For all possible combinations of population, isolation and prominence we use hundred iterations of five cross-validation. We analyze to which extent the novel valuation functions help to classify university municipalities in Germany and France.

Evaluation

We use the g-mean (i.e., geometric mean) as evaluation function. Consider the confusion matrix depicted in Table 2.

Overall accuracy (i.e., how many test examples are classified correctly) is highly misleading in the context of heavily imbalanced data. It is obvious that for any classifier function predicting the majority would lead to an excellent accuracy (Chawla, 2010). Therefore, we will evaluate the classification decisions by using the geometric mean of the accuracy on the positive instances, $acc_{+}:=\frac{TP}{TP+FN}$ , often called sensitivity, and the accuracy on the negative instances $acc_{-}:=\frac{TN}{TN+FP}$ , often called specificity. Hence, the g-mean score is then defined by the formula $g_{mean}:=\sqrt{acc_{+}\cdot acc_{-}}$ . The evaluation function g-mean is established in the topic of imbalanced data mining. It is mentioned in (He and Garcia, 2009) and used for evaluation in (Akbani et al., 2004).

In our setup, the university locations are the positive class, meaning that $acc_{+}$ corresponds to the classification results on the university locations, and $acc_{-}$ corresponds to the accuracy on non university locations. For our experiments we now compare the values for g-mean for the following cases. First, we train a classifier function purely on the features population, prominence or isolation. Secondly, we also try combinations of them for the training process. We consider in all those experiments the classifier solely trained using the population feature as baseline, since this classification function does not incorporate any metric aspects of the data set. Then, an increase in g-mean when using prominence or isolation together with the population function is evidence for the utility of the introduced valuation functions. Furthermore, when directly comparing a classifier function that is trained on isolation/prominence with a version trained on population, an increase in g-mean strongly indicates the importance of the novel features.

In our experiments, we are not expecting high values for g-mean, since the placement of university locations depends on many additional features, including historical evolution of the country and political decisions. However, we claim that the evaluation setup above is sufficient to show that the novel features are potentially helpful for identifying interesting and useful items in different tasks.

Results

The results of our evaluation can be found in Table 3. In the following we collect the observations drawn from this table.

• **Isolation is a good indicator for structural relevance.**Considering the results for both countries we notice that using isolation as the only feature leads to a solid prediction of university and non-university locations. For both countries and classifiers, it outperforms population.

• **Combining absolute height with our valuation functions leads to better results.**Combining our orometric functions with population leads to better performance compared to solely the population feature.

• **Prominence is not useful as a solo indicator.**Our result raises confidence that prominence alone is not an useful indicator for finding university locations. We may propose the following explanation. Prominence is a very strict valuation function: recall that we constructed the graphs by using distance margins as indicators for edges, leading to a dense graph structure in more dense parts of the metric space. It follows that a point in a more dense part has many neighbors and thus many potential paths that may lead to a very low prominence value. Observing definition Definition 3.3, one can see that having a higher neighbor, with respect to the height function, always leads to a prominence value of zero. As mentioned earlier, the threshold is about 32 kilometers for Germany and 54 kilometers for France. Hence, a municipality has a not vanishing prominence if it is the most populated point in a radius of over 32 kilometers, respectively 54 km. Only 75 municipalities of France have non zero prominence, with 41 of them being university locations. Germany has 124 municipalities with positive prominence with 78 of them being university locations. Thus, prominence alone as a feature is insufficient for the prediction of university locations. As indicated in Table 3, the low g-mean score results from bad accuracy on the positive instances. Overall, it is an useful feature for identifying outstanding “peaks”.

• **The results for Germany differ from the results for France.**The margin in which isolation outperforms population as solely feature is for Germany greater than for France. The same holds for the score improvement if we add prominence and isolation as features to population. We assume that this observation is based on the difference in the geographic population distribution in France and in Germany: Having another look at Figure 1, one may observe a tendency of clustering of university locations in some French areas. For example, looking at the area around Paris, one may observe a variety of universities located in the regional surrounding. The represented municipalities are all dominated by the nearby city Paris. As a consequence, they have a low isolation and prominence value.

• **Support vector machine and logistic regression lead to similar results.**To the question, whether our valuation functions improve the classification compared with the population feature, support vector machines and logistic regressions provide the same answer: isolation always outperforms population, a combination of all features is always better then using just the plain population feature.

• **Support vector machine penalty parameter.**Finally, for our last test we check the different results for support vector machines using the penalty parameters $C\in\{0.5,1,2,5,10,100\}$ . We observe that increasing the penalty results in better performance using the population feature. However, for lower values of $C$ , i.e., less overfitting models, we see better performance in using the isolation feature. In short, the more the model overfits due to $C$ , the less useful are the novel valuation functions we introduced in this paper.

6. Conclusion and Outlook

In this work, we presented a novel approach to identify outstanding elements in item sets. For this we employed orometric valuation functions, namely prominence and isolation. We investigated a computationally reasonable transfer to the realm of bounded metric spaces. In particular, we generalized previously known results that were researched in the field of finite networks.

The theoretical work was motivated by the observation that knowledge graphs, like Wikidata, do contain huge amounts of metric data. These are often equipped with some kind of height functions in a natural way. Based on this we proposed in this work the groundwork for an item recommending scheme. This envisioned system would be capable of enriching conventional setups.

To evaluate the capabilities for identifying outstanding items we selected an artificial classification task. We identified all French and German municipalities from Wikidata and evaluated if a classifier can learn a meaningful connection between our valuation functions and the relevance of a municipality. To gain a binary classification task and to have a benchmark, we assumed that universities are primarily located at relevant municipalities. In consequence, we evaluated if a classifier can use prominence and isolation as features to predict university locations. Our results showed that isolation and prominence are indeed helpful for identifying relevant items.

For future work we propose to develop the conceptualized item recommender system and to investigate its practical usability in an empirical user study. Furthermore, we urge to research the transferability of other orometric based valuation functions. Finally, we acknowledge that our results about valuation functions in metric spaces are surely already present in mathematical theory. To identify the related mathematical notions and therefore to nourish from advanced mathematical results would be the next theoretical goal.

Acknowledgements.

The authors would like to express thanks to Dominik Dürrschnabel for fruitful discussions. This work was funded by the German Federal Ministry of Education and Research (BMBF) in its program “Quantitative Wissenschaftsforschung” as part of the REGIO project under grant 01PU17012.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Akbani et al . (2004) Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz. 2004. Applying Support Vector Machines to Imbalanced Datasets. In Machine Learning: ECML 2004 , Jean-François Boulicaut, Floriana Esposito, Fosca Giannotti, and Dino Pedreschi (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 39–50.
3Barthélemy (2011) Marc Barthélemy. 2011. Spatial networks. Physics Reports 499, 1 (2011), 1 – 101. https://doi.org/10.1016/j.physrep.2010.11.002 · doi ↗
4Bentley (1975) Jon L Bentley. 1975. A Survey of Techniques for Fixed Radius Near Neighbor Searching. Technical Report. SLAC, SCIDOC, Stanford, CA, USA. http://slac.stanford.edu/pubs/slacreports/reports 09/slac-r-186.pdf SLAC-R-0186, SLAC-0186.
5Bordes et al . (2011) Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio. 2011. Learning Structured Embeddings of Knowledge Bases. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2011, San Francisco, California, USA, August 7-11, 2011 , Wolfram Burgard and Dan Roth (Eds.). AAAI Press, Palo Alto, California 94303, 301 – 306. http://www.aaai.org/ocs/index.php/AAAI/AAAI 11/paper/view/3659
6Catherine and Cohen (2016) Rose Catherine and William Cohen. 2016. Personalized Recommendations Using Knowledge Graphs: A Probabilistic Logic Programming Approach. In Proceedings of the 10th ACM Conference on Recommender Systems (Rec Sys ’16) . ACM, New York, NY, USA, 325–332. https://doi.org/10.1145/2959100.2959131 · doi ↗
7Chawla (2010) Nitesh V. Chawla. 2010. Data Mining for Imbalanced Datasets: An Overview. In Data Mining and Knowledge Discovery Handbook , Oded Maimon and Lior Rokach (Eds.). Springer, Heidelberg, 875–886. http://dblp.uni-trier.de/db/reference/dmkdh/dmkdh 2010.html#Chawla 10
8Chawla et al . (2002) Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321–357.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Orometric Methods in Bounded Metric Data

Abstract.

1. Introduction

2. Related Work

3. Mathematical Modeling

Definition 3.1 (Isolation).

Definition 3.2.

Definition 3.3 (Prominence in Networks).

Definition 3.4 (δ\deltaδ-Prominence in Metric Spaces).

Definition 3.5 (Minimal Threshold).

Lemma 3.6.

Proof.

Definition 3.7 (Prominence in Metric Spaces).

Lemma 3.8.

Proof.

4. Application

4.1. Score based item recommending

4.2. Enriching metric item sets in Wikidata

4.3. Resulting Questions

5. Experiments

5.1. Dataset

5.2. Binary Classification Task

Setup

Evaluation

Results

6. Conclusion and Outlook

Acknowledgements.

Definition 3.4 ( $\delta$ -Prominence in Metric Spaces).