Geo-L: Linking Geospatial Data Made Easy

Christian Zinke-Wehlmann; Amit Kirschenbaum

arXiv:1906.05366·cs.DB·September 4, 2020

Geo-L: Linking Geospatial Data Made Easy

Christian Zinke-Wehlmann, Amit Kirschenbaum

PDF

Open Access

TL;DR

Geo-L is a system that simplifies linking and integrating geospatial Linked Data by efficiently discovering spatial links based on topological relations, improving accuracy and retrieval performance.

Contribution

The paper introduces Geo-L, a novel system for discovering RDF spatial links using topological relations, enhancing existing spatial linking methods.

Findings

01

Improves mapping-time and accuracy in spatial linking

02

Enhances resources retrieval efficiency and robustness

03

Outperforms state-of-the-art spatial linking processes

Abstract

Geospatial Linked Data is an emerging domain with growing interest in research and industry. There is an increasing number of publicly available geospatial Linked Data resources and they need to be interlinked and easily integrated with private and industrial Linked Data on the Web. The present paper introduces Geo-L, a system for discovery of RDF spatial links based on topological relations. Experiments show that the proposed system improves state-of-the-art spatial linking processes in terms of mapping-time and -accuracy, as well as concerning resources retrieval efficiency and robustness.

Tables2

Table 1. Table 1: Geometry features of components of within formula and dimension of their intersections

Table 2. Table 2: Comparison of properties of systems for geospatial link discovery

System	Scalability & Efficiency	Robustness	Interoperability & Flexibility
Silk	– long running time on large data-sets	– instances limited to size of 64K	+ standalone framework
		– not evaluated for relations cover and covered by	+ has REST and programmable APIs
			– linkage definition language is restricting
			– does not support transformation of geospatial data
AML	+ achieves best run time for touches and intersects for LineStrings	– reaches time limit for disjoint (75 min.)	+ uses ESRI, an external module for handling geometries
	– long running time on large data-sets for LineString/ Polygon tasks for contains within covers	– no information is given about error handling	– strict linkage definition
OntoIdea	– long running time on large datasets	– not evaluated for disjoint	– no specification given
	– not evaluated for large data-sets	– no information about error handling
Strabon	+ run time for intersects on smaller data-sets is better than that of LIMES	– did not finish any experiment on a large dataset within the time limit (2 hours)	+ implements GeoSPAQRL, thus is able to transform geospatial object in retrieval time
		– doesn’t provide feedback about progress of its task
		– no transparent error handling
LIMES	+ addresses all tasks regarding topological link discovery	– data or server error interrupt whole process	+ can be applied as part of a framework or as a part of an application via its API
	+ achieves the best run-time performance for most of the topological relations (except intersect, and touches)		– strict linkage definition (XML), no direct SPARQL support
Geo-L	+ addresses all tasks regarding topological link discovery	+ storing chunks of datasets regularly minimizes data loss if connection is interrupted due to e.g., server error	+ can be applied as an independent application or through its API (as well as via REST API)
	+ achieves the best run-time performance for all topological relations	+ provides feedback about task progress	+ supports dataset definition via SPARQL query

Equations6

\small{\begin{array}[]{l}\text{DE+9IM}(a,b)=\\ \begin{bmatrix}dim(I(a)\cap I(b))&dim(I(a)\cap B(b))&dim(I(a)\cap E(b))\\ dim(B(a)\cap I(b))&dim(B(a)\cap B(b))&dim(B(a)\cap E(b))\\ dim(E(a)\cap I(b))&dim(E(a)\cap B(b))&dim(E(a)\cap E(b))\\ \end{bmatrix}\end{array}}

\small{\begin{array}[]{l}\text{DE+9IM}(a,b)=\\ \begin{bmatrix}dim(I(a)\cap I(b))&dim(I(a)\cap B(b))&dim(I(a)\cap E(b))\\ dim(B(a)\cap I(b))&dim(B(a)\cap B(b))&dim(B(a)\cap E(b))\\ dim(E(a)\cap I(b))&dim(E(a)\cap B(b))&dim(E(a)\cap E(b))\\ \end{bmatrix}\end{array}}

d im (S) = ⎩ ⎨ ⎧ - 1 - 0 - 1 - 2 if S = \emptyset if S contains at least one point, but no lines or polygons if S contains at least one line, but no polygons if S contains at least one polygon

d im (S) = ⎩ ⎨ ⎧ - 1 - 0 - 1 - 2 if S = \emptyset if S contains at least one point, but no lines or polygons if S contains at least one line, but no polygons if S contains at least one polygon

a . w i t hin (b) = T * * * * * F F *

a . w i t hin (b) = T * * * * * F F *

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Semantic Web and Ontologies · Geographic Information Systems Studies

Full text

11institutetext: Institut für Angewandte Informatik an der Universität Leipzig (InfAI)

Goerdelerring 9

04109 Leipzig, Germany

11email: [email protected]

Geo-L: Linking Geospatial Data Made Easy ††thanks: This work has been supported by the DataBio project, funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 732064.

Christian Zinke-Wehlmann

Amit Kirschenbaum

Abstract

Geospatial Linked Data is an emerging domain, with growing interest in research and industry. There is an increasing number of publicly available geospatial Linked Data resources and they need to be interlinked and easily integrated with private and industrial Linked Data on the Web. The present paper introduces Geo-L, a system for discovery of RDF spatial links based on topological relations. Experiments show that the proposed system improves state-of-the-art spatial linking processes in terms of mapping-time and -accuracy, as well as concerning resources retrieval efficiency and robustness.

Keywords:

Geospatial analysis Linked Data Semantic Web Topological relations

††journal: Journal on Data Semantics

1 Introduction

Web of Data, or Semantic Web, is a continuously growing global data space.111see: https://www.w3.org/2013/data/ Semantic Web standards, such as RDF (Klyne and Caroll, 2004; RDF Working Group, 2014), OWL (Bechhofer et al., 2004; OWL Working Group, 2012), and SPARQL (Prud’hommeaux and Seaborne, 2008) were developed to express and exchange semantic information on the Web, which tackle the challenge of interoperability (Hitzler et al., 2009). In the geospatial context, most prominent is the GeoSPARQL initiative, which offers a necessary vocabulary to develop geo-related data on the Semantic Web (Battle and Kolas, 2011). In recent years, geospatial linked data gained increasing attention (Nikolaou et al., 2015), also due to advances in the Earth Observation domain (Koubarakis et al., 2017). Thus, numerous resources of linked geospatial data have been developed, e.g., LinkedGeoData (Auer et al., 2009), Smart Point Of Interest (Čerba et al., 2016), Spanish Cases (de León et al., 2010), and Ireland’s national geospatial data (Debruyne et al., 2016); the domain is constantly growing within the Linked Data Cloud. Notably, the domain of geospatial data contains complex datasets, as NUTS (Eurostat - European Commission, 2015), which describe territories using polygons that may be more than 1700 points long.

According to the Linked Data principles, published data should be interlinked with other datasets on the Web (Bizer et al., 2011). In general, linking (and fusing) of geospatial linked data sources enable large-scale inferences and data integration (Wiemann and Bernard, 2016). Nevertheless, explicit links are often not part of the dataset and should be discovered automatically, even in a distributed cloud environment and huge datasets. These linking activities are one pillar to foster the development of innovative software solutions. In particular, the linking of geospatial data is a challenging task, since the links express relations which depend on complex geometric computations.

The present work introduces Geo-L, a system for discovery of spatial links in RDF datasets according to topological relations. Geo-L was developed considering the following requirements, which we identified by comparing existing approaches, services, and tools for this task:

Scalability and efficiency: As mentioned before, the Linked Data cloud is continually growing employing new sources and data sets. The service should be able to handle big data sets. The idea is to provide a service for different Linked Data environments (open or closed). Therefore, the time efforts have to be reduced on a significant minimum. The vision is to discover even extensive data sets in near real-time. 2. 2.

Robustness: The service must retain functionality under unforeseen conditions, as missing or corrupted data. This is especially true for crowd-sourced or automatically generated data sets, which are likely to include errors as the size of data grows. 3. 3.

Interoperability and flexibility: The service has to be handled as easy and transparent as possible. The (SPARQL affine) user should be able to easily formulate queries to retrieve source and target datasets, as well as the linking condition. This includes the ability to handle data whose representation is not compatible for computing of, e.g., topological relations. Further, the service has to handle on-the-fly requests by a RESTful input processing. It has to operate easily as a standalone system or as a module integrated into other applications.

2 Background

Linked (Open) Data refer to an area which focuses on the publishing of RDF (Resource Description Framework) on the Web of Data. However, the Linked Data approach is strongly linked to the Linked Data Principles by Tim Berners-Lee (Bizer et al., 2011). The basic idea of link discovery is to find data items within the target dataset which are logically connected to the source dataset. More formaly this means: Given $\mathscr{S}$ and $\mathscr{T}$ , sets of RDF resources, called source and target resources respectively, and a relation $R$ , the aim of link discovery methods is to find a mapping $M$ such that $M=\{(s,t)\in\mathscr{S}\times\mathscr{T}:R(s,t)\}$ . Naive computation of M requires quadratic time complexity to test for every $s\in\mathscr{S}$ and $t\in\mathscr{T}$ whether $R$ holds, which is unfeasible for large data sets.

In geospatial context, $\mathscr{S}$ and $\mathscr{T}$ are sets of spatial objects, which contain geometries in a two dimensional space as features; the links may be based on proximity or on topological relations. In the latter case, relations are expressed by the Dimensionally Extended nine-Intersection Model (DE+9IM) (Clementini et al., 1993, 1994), which was accepted as an ISO standard (ISO 19107:2003, E). DE+9IM classifies binary spatial relationships between two geometries, $a$ and $b$ , which may be points, lines, or polygons, based on intersection of interiors (I), boundaries (B) exteriors (E) of $a$ with those of $b$ .

A combination of these six geometric features define topological relations, which are described in a $3\times 3$ matrix as follows:

[TABLE]

The intersection $S$ of some feature of $a$ with a feature of $b$ , may be either empty or in itself a geometric object, namely: a point, a line, or a polygon. $dim(S)$ returns the dimension of the geometry $S$ ; if $S$ consists of multiple geometries then $dim(S)$ is the maximal dimension of intersection if it is of multiple parts.

[TABLE]

In addition to the dimensions values the matrix may contain the values T $(dim(S)\geq 0)$ , F $(dim(S)=-1)$ , and * (“don’t-care” value, which means that the value in this matrix cell has no influence on the outcome of a function applied to this matrix). The model defines topological predicates to describe the spatial relations between the two geometries in a compact and human-interpretable manner, which are defined by pattern matrices: equals, disjoint, intersects, touches, crosses, overlaps, within, and contains. For example, the pattern matrix for the relation within is defined by the following pattern matrix 222see also Strobl (2008)

[TABLE]

formally described as $(I(a)\cap I(b)\neq\varnothing)\wedge\neg(I(a)\cap E(b)\neq\varnothing)\wedge\neg(B(a)\cap E(b)\neq\varnothing)$ .

To illustrate how this matrix, and hence, the formula define the within relation consider Figure 1, which shows two geometries $a$ and $b$ , such that $a$ is within $b$ . We use Table 1 to graphically depict the respective features $f_{1}(a),f_{2}(b)$ , such that $f_{1},f_{2}\in\{I,B,E\}$ , used in each component of the within formula, for those two geometries, as well as the dimension of their intersection. As can be observed the conditions of the topological relation within are satisfied.

3 Related Work

Link discovery of topological relations among RDF data sets has received growing interest in recent years, and various methods for this problem have been proposed. These methods usually define the topological relations between two geometries based on their relations computed between their minimum bounding boxes. A minimum bounding box (MBB) is the rectangle of minimum area that encloses all coordinates of geometry and is a commonly used as an approximation to the geometry to reduce computational costs that involve this geometry (Freeman and Shapira, 1975).

Smeros and Koubarakis (2016) use the MultiBlocking technique (Isele et al., 2011) to discover topological relations. This technique divides the earth surface into curved rectangles, and assigns each geometry to all blocks in which it intersects, based on the geometry’s MBB. Relations discovered within each block are then aggregated to construct the links. This method is embedded in the Silk framework (Volz et al., 2009).

Radon (Sherif et al., 2017) divides the space into hyper-cubes and uses optimized sparse space tiling to index geometries. This is done by mapping each geometry to the set of hyper-cubes over which it’s minimum bounding box (MBB) spans. The method first indexes geometries $s\in\mathscr{S}$ and then only index geometries $t\in\mathscr{T}$ that may potentially reside in hyper-cubes already contained in the index. To minimize the size of the index, the method implements a swapping strategy, that is, prior to the indexing phase it calculates an estimated total hypervolume ( $eth$ ) for each of the datasets $\mathscr{S}$ and $\mathscr{T}$ . If $eth(\mathscr{T})<eth(\mathscr{S})$ then it swaps the two datasets and computes the reverse relation of the requested relation $R$ . The link generation itself is done using a method that reduces computations on a subset of DE+9IM relations. Radon is implemented as part of the LIMES framework (Ngomo and Auer, 2011)333https://github.com/dice-group/LIMES/.

Faria et al. (2017) adapt the AgreementMakerLight (AML) (Faria et al., 2014), a framework for automated ontology matching, to tackle the task of topological relations. This is done by utilizing ESRI Geometry API444https://github.com/Esri/geometry-api-java/, which uses quadtree as means to index geometries and detect topological relationship among them.

These methods, as well as OntoIdea (Khiat and Mackeprang, 2017), were evaluated on several sets of geometries: Achichi et al. (2017) apply them to discover topological relations between LineStrings, constructed of trajectories from the TomTom555https://www.tomtom.com dataset. Saveta et al. (2018) apply these methods to find relations between LineStrings to LineStrings and between LineStrings to Polygons, from TomTom dataset and Spaten dataset (Doudali et al., 2017) respectively. All datasets included at most 2000 instances. Both evaluations report that the methods mentioned above discover links correctly, that is, the $F$ -score of most of them is $1.0$ (apart from OntoIdea which $F$ -score lies between $0.91$ and $0.99$ , and did not take part in the tasks for link discovery between linestrings and polygons).

Strabon (Kyzirakos et al., 2012) is an open-source geospatial RDF store. It is based on the RDF4J (previously Sesame) RDF store and adds geospatial capabilities to it by implementing the OGC-standard GeoSPARQL, where as part of the implementation the stored geometries in Strabon are indexed with an R-Tree-over-GiST. Implementing GeoSPARQL means that Strabon includes topological functions; thus, queries that use these functions can be viewed as a means to discover topological relations. Sherif et al. (2017) compares the performance of Silk, Strabon, and Radon where they are applied to discover links between different subsets of NUTS and CORINE Land Cover666https://www.eea.europa.eu/publications/COR0-landcover datasets, which map land and land-usage respectively. The biggest dataset used in their experiments is of size $2,209,538$ .

The evaluations compare the running times of these methods with different dataset sizes. It has already been acknowledged that a significant portion big data is geospatial data (Lee and Kang, 2015; Li et al., 2016), thus our interest lies in the performance of these systems on large datasets. Table 2 summarizes how well the methods described above perform, regarding the criteria for useful geospatial link discovery systems, discussed in Section 1, as reported in the literature (Sherif et al., 2017; Achichi et al., 2017; Saveta et al., 2018).

As can be observed in Table 2, the LIMES system, that implements Radon, was the one who completed all the link discovery task for all topological relations and performed best for most of them. We, therefore, take LIMES as our main reference point. Nevertheless, LIMES as it is777We used version 1.5.5, the latest version available at the time of writing, is not sufficiently flexible to accommodate geospatial data of different formats, and requires external pre-processing of input. Additionally, LIMES assumes an error-free download and curated data-sets, which is not always the case in reality. This motivates us to incorporate advantages of existing techniques in a single solution and test what existing technologies might be used for an efficient, flexible, robust and interoperable system for on-the-fly semantic linking of geospatial data.

4 Geo-L

We developed a system for geo-spatial linking, which provides the required functionality and shows high performance and accuracy. Geo-L also offers flexible configuration options for the SPARQL affine user as well as accurate error handling.

4.1 Input

The input for a link discovery task provides the resources to be linked and the conditions upon which the links are generated, in a simple, yet flexible manner. In particular, our method offers a way to retrieve relevant properties from the endpoint via a SPARQL query; thus it natively supports manipulation of data, without any need for external pre-processing. This is useful, for example, when geometry values at the endpoint are not represented in a format that directly allows computations of topological relations.

4.2 Download

Downloading from a SPARQL endpoint might occasionally be interrupted before the complete dataset has been delivered. To avoid a total loss of the data our solution does not store all the data in memory while downloading, but instead, periodically write smaller chunks to disk. In addition, download might take a relatively long time due to application implementation itself. Our solution seeks to improve this state by reducing the application overhead when querying the remote endpoint.

4.3 Caching

To accelerate access to the source- and target-resources we incorporate a caching mechanism. Data retrieved from the SPARQL endpoint are stored in a central data store with an internal index. Further requests for data items from the same endpoint will be first served from the cache if the items are already indexed. This ensures a single local resource parallel to the endpoint, which serves arbitrarily many configurations, thus saves both time and storage. This differs from the behavior of LIMES, where data items may be downloaded multiple times, and duplicates of the data may be then stored. Algorithm 1 sketches the caching process. The method essentially compares the required triples range to the triple indices stored in an internal database, based on offset and limit parameters given in the configuration. It detects the indices of triples which are not already stored, retrieves the respective triples in chunks from the endpoint, and stores them in the database.

4.4 Link Discovery

The task of geo-links discovery requires efficient processing of spatial data, and therefore we use R-trees (Guttman, 1984) as our underlying data structure. An R-tree is a data structure used to store and query multi-dimensional objects, in a way that and preserves spatial relations, as vicinity and nesting, among the indexed objects. An R-tree represents each object by its minimum bounding box (MBB), i.e., the smallest rectangle that encloses it, and a leaf node stores the MBB of that object and a pointer to the actual geometry. An R-tree is organized hierarchically; it groups MBBs by proximity and represents them by their MBB in a higher level of the tree. This process proceeds until all the MBBs are nested in a single bounding box - the tree root. R-Trees have shown to be efficient in processing spatial joins, to find topological relations between different data sets (Brinkhoff et al., 1993). R-Trees support both individual elements search as well as range search, where all the items within a rectangle are retrieved.

A practical problem occurs when the data contain errors, i.e., invalid geometries. The implications of using such data are wrong results, application performance issues, etc. For this reason, geometries are examined before indexing; invalid geometries are not indexed, and thus do not participate in the link discovery.

4.5 Implementation

We use Python as our preferred programming language, since it became the language of choice for data science in general, and provides useful tools for handling geospatial data, in particular. We have experimented with the following technologies:

4.5.1 GeoPandas

Our initial implementation involved custom built caching and mapping mechanisms. We use Python’s GeoPandas library (Jordahl, 2016), which implements data structures for storing geometric types, as well as analysis tools for geospatial data. In particular, GeoPandas provides an interface for spatial joins, which allow combining observations stored in these data structures based on their spatial relations. For this purpose GeoPandas indexes geometries using R*-Tree (Beckmann et al., 1990), a variant of R-Tree that provides better search performance, at the cost of increased construction time. GeoPandas currently supports finding the following spatial relations: within, intersects, and contains.

We further experimented with cython (Behnel et al., 2011), a language which is a superset of Python, where code can be compiled directly to C, generating efficient code. GeoPandas has been reimplemented in Cython in a way that optimizes the storage of geometries and should improve the performance of spatial operations.

4.5.2 PostgreSQL

Furthermore, we implement the system using PostgreSQL, an open source object-relational DBMS, with PostGIS extension, which provides functionality to manage geospatial data, such as geometry data types, efficient indexing, and spatial joins, and is compliant with the Open Geometry Consortium (OGC) OpenGIS specifications. PostGIS implements spatial indexing with an R-Tree-over-GiST (Refractions Research Inc., 2018). GiST, Generalized Search Tree (Hellerstein et al., 1995), is a height-balanced tree structure and allows arbitrary indexing schemes. The choice to use this as the backend of our is multi-fold:

•

GiST indexes are “null safe”, therefore attempting to build an R-Tree on data which contains an empty geometry field will fail.

•

GiST uses a compression technique which results in fast indexing.

•

The database facilitates the implementation of the resource caching mechanism

The source code of Geo-L is available at https://github.com/DServSys/Geo-L

5 Experimental Settings

5.1 Datasets

The evaluation has been done by finding different relations between points to polygons, and polygons to polygons in the following datasets.

•

SPOI - Smart Points of Interest: A data set, which contains over 30 million Points of Interest important for tourism around the world (Cerba and Mildorf, 2016).

•

OLU - Open Land-Use: Maps land use on local and regional level; contains over 11 million geometries – Polygons and MultiPolygons (Mildorf et al., 2014).

•

NUTS - Nomenclature of Territorial Units for Statistics: A standard for referencing European countries and their regions, for statistical processes (Eurostat - European Commission, 2015).

These datasets are stored under different graphs in the SPARQL enpoint888https://www.foodie-cloud.org/sparql of the FOODIE project999http://www.foodie-project.eu/. While SPOI and OLU are excellent examples for big (open) linked data, NUTS is a standard schema. NUTS geometries are not represented in WKT form, and must be be manipulated to conform to the form required by procedures of topological relations computation.Tools like LIMES, however, do no support such cases.

We compare the performance of LIMES and Geo-L with respect to both topological relations discovery and data retrieval time from endpoints.

5.2 Experiments

The performance of the Geo-L systems is evaluated in terms of runtime by conducting experiments on simulations test-sets as well as real-world scenarios. We also note differences in linking results if they occur. In order to compare the performance of our system with that LIMES, which is implemented using parallel processing. The task is viewed as consisting of two stages: download and caching, and linking; we report the performance for each of them. The simulations enable evaluation of system performance under realistic conditions, with scenarios which otherwise might not be explored, and at the same time providing reliable way confirm their results. All experiments have been performed on a 64-bit Linux machine with an Intel Core i7-7800X CPU @ 3.50GHz and total of 12 threads (6 CPU cores $\times$ 2 threads per core).

5.2.1 Simulation

Our simulations consist of finding topological relations where the subsets of OLU dataset are used as both source and target datasets. This setting has multiple advantages: First, it allows to demonstrate the benefits of caching, regarding data sets retrieval. Additionally, the structure of the OLU set, which consists of separate geometries with non-hierarchical relations, facilitates the link quality evaluation. We used this approach to perform a preliminary comparison of three implementations on a subset of 165,000 entities (as source and target sets) and observed that the implementations which used GeoPandas performed considerably slower than the one which employed PostgreSQL with PostGIS. For example, the mapping time required for calculating the within relation was 38 seconds for the implementation which used GeoPandas, about 20 minutes for the GeoPandas cython implementation, and less than 4 seconds for the implementation which used PostgreSQL. Therefore, in the following experiments, the latter serves as our reference system.

We tested the systems with two subsets: the one contains the first 165,000 geometries, and the other the first 400,000 geometries. Figure 2 compares the dataset the retrieval times of OLU subsets for both LIMES and Geo-L. The first scenario shows that retrieval time for LIMES is about twice as long compared to Geo-L. The reason is that LIMES does not detect whether data already exist, and download the same OLU subset twice, both as source and target datasets. The second scenario emphasizes this phenomenon: Whereas Geo-L retrieves the data which has not already been downloaded yet, and does it only once, LIMES retrieves twice the subset of 400,000 geometries, which takes more than six times longer.

Moreover, LIMES stores redundant data e.g., the subset of the first 165,000 geometries is store four times, as it is contained in the 400,000 geometries subset.

Experiments have been repeated ten times for each topological relation type per subset, and the average mapping times are shown for both LIMES and Geo-L in Figure 3 and Figure 4. As can be observed, Geo-L discover topological links faster than LIMES, for all relations in these experiments. The coefficients of variation (CV) of runtimes for the different experiments were found to be low in all cases (CV $<$ 0.1), which indicates that these results are consistent.

In addition, we found discrepancies between the links discovered by each system. For example, when looking for links of entities which stand in the within relation in two sets with identical entities, the expected result is that each item in the source set would stand in this relation with exactly one entity of the target set, and that the size of the returned set would be equal to the size of each of sets. However, for the $165\cdot 10^{3}$ OLU subset Geo-L found $164,935$ links, whereas LIMES found $155,083$ . The 65 entities which Geo-L did not include had invalid geometries, which were detected already during construction and omitted from the search space. We examined the result computed by LIMES and noticed that the difference of $9852$ consisted mostly of “false negatives” errors, i.e., valid geometries which were omitted from the result set ( $9849$ links). Also, there were three links that Geo-L did not found and LIMES did. These, however, are “false positives”, i.e., the links contained invalid geometries, which were included in the result set by LIMES, whereas Geo-L has omitted them already before computing the links. Similar errors occurred also for other topological relations.

5.2.2 Real-World Scenarios

We experiment with topological relation discovery between pairs of geospatial resources mentioned in Section 5.1, and compare their performance to that of LIMES. Figure 6 shows the performance, in terms of mapping runtime, on different subsets of SPOI and OLU. In this example the largest subset does not contain the other two: the first $500\cdot 10^{3}$ entities of OLU contain geometries which caused LIMES system to crash, and therefore we chose a subset of the same size but specified a different offset.

Figure 6 shows the running times for mapping SPOI to NUTS with different subset sizes of SPOI. Since NUTS geometries are not represented in WKT format we used a configuration feature which defines a resource via a SPARQL query. In this case, the query also transforms the geometries into the required format. This, however, is not possible in LIMES, and therefore comparison of the systems is not presented.

Figrue 7 shows mapping runtime for different subsets of OLU to NUTS, for different topological relations.

The system has been employed as part of DataBio, a EU Horizon 2020 project. A main goal of the project is to show the benefits of Big Data technologies in the raw material production from agriculture for the bioeconomy industry. The project uses Linked Data as a federated layer to integrate to integrate cross-organizational heterogeneous data.

In particular, Geo-L has been successfully applied to various use cases in field management, e.g.:

•

identifying plots from the Czech registry of farmland, which intersect with buffer zones around water bodies. A buffer zone is a vegetated or forested strip around lakes and along water courses. Its purpose, in the context of agricultural management, is to protect water bodies from pollutants as pesticides, nutrients, and sediment (Zhang et al., 2010). Therefore, it is crucial to detect cases where field areas and buffer zones intersect. Figure 8 depicts a case where a buffer zone of a lake intersects with a field, marked with orange

•

identifying erosion zones for a specific farm. Soil erosion zones is the detachment and deposition of soil particles. It may be caused by e.g., wind, snow, water, but also due to human-induced land use (Vanwalleghem, 2016). As the latter results in much faster erosion rates it can effect soil quality dramatically due to loss of nutrients as well as the ability to get them. It is therefore important to control erosion This, since it impacts productivity and sustainability negatively (Larson et al., 1983; Blanco-Canqui and Lal, 2010). Figure 9 shows erosion zones overlap with a plot, marked in dark blue.

•

identifying fields within a particular region, which grow the same crop type for a specific year as given field in that region. This serves as an assisting tool for farm management and agricultural landscape planning, e.g., controlling crop diversification or rotation. Figure 10 presents all fields, which grow the same crop type like the field marked in brown, here, maize for silage, during 2019, within the South Moravian Region (region border marked in grey).

6 Conclusions

This paper presented Geo-L, a system for discovering RDF links between geospatial entities, based on topological relations. We conducted experiments to detect topological relations between points and polygons, and between polygons and polygons. The experiments show that Geo-L outperforms LIMES (Ngomo and Auer, 2011), a state-of-the-art link discovery system, for this task in several aspects:

•

Scalability and efficiency: Geo-L configuration allows to form a dataset directly by the SPARQL query that defines it. This feature is, in particular, useful when data at the SPARQL endpoint are stored differently than specified for the linking task, but could be transformed into the required format through SPARQL functions.

–

Download time: Datasets are cached not for a single task but are regarded as resources of their own. Thanks to its caching mechanism, Geo-L accesses the SPARQL endpoints only when data required in the dataset are missing, and expands existing datasets where possible.

–

Mapping time: Geo-L utilizes PostgreSQL with PostGIS index for storing and indexing of the data. This enables efficient spatial joins between source- and target-datasets.

•

Robustness: Geo-L includes multiple features that strengthen the robustness of the application.

–

Caching: Geo-L caches portions of the data as they are downloaded, rather than writing the whole dataset after being downloaded. This property prevents data loss when, e.g., connection to the remote endpoint is lost.

–

Mapping accuracy: Geo-L detects entities with invalid geometries (compliant to OGC OpenGIS specification) and does not include them in the search space. In addition, in several cases LIMES did not include valid geometries in the result set, whereas Geo-L correctly did.

•

Interoperability and flexibility: Geo-L can be used as a stand-alone application or as a REST service (in a docker), which allows it to be integrated with other applications. The easy SPARQL-based and slim set-up of target and source configuration (as JSON) enables a very free usage of the tool.

Future work will examine relations between other types of geometries as well as explore geospatial relations based on various distance measures. The current implementation recalls the same items for each dataset once they are cached. In the future we will also address re-caching in case data at the SPARQL endpoint have been modified, an issue which is, to the best of our knowledge, not handled by other geospatial-linking systems.

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Achichi et al. (2017) Achichi M, Cheatham M, Dragisic Z, Euzenat J, Faria D, Ferrara A, Flouris G, Fundulaki I, Harrow I, Ivanova V, et al. (2017) Results of the ontology alignment evaluation initiative 2017. In: OM 2017-12th ISWC workshop on ontology matching, CEUR-WS, pp 61–113
2Auer et al. (2009) Auer S, Lehmann J, Hellmann S (2009) Linkedgeodata: Adding a spatial dimension to the web of data. In: International Semantic Web Conference, Springer, pp 731–746
3Battle and Kolas (2011) Battle R, Kolas D (2011) Geosparql: enabling a geospatial semantic web. Semantic Web Journal 3(4):355–370
4Bechhofer et al. (2004) Bechhofer S, Van Harmelen F, Hendler J, Horrocks I, Mc Guinness DL, Patel-Schneider PF, Stein LA, et al. (2004) Owl web ontology language reference. W 3C recommendation
5Beckmann et al. (1990) Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The r*-tree: an efficient and robust access method for points and rectangles. In: Proceedings of the 1990 ACM SIGMOD Internatioanl Conference on Management of Data (SIGMOD’90), ACM, pp 322–331
6Behnel et al. (2011) Behnel S, Bradshaw R, Citro C, Dalcin L, Seljebotn DS, Smith K (2011) Cython: The Best of Both Worlds. Computing in Science & Engineering 13(2):31–39
7Bizer et al. (2011) Bizer C, Heath T, Berners-Lee T (2011) Linked data: The story so far. In: Semantic services, interoperability and web applications: emerging concepts, IGI Global, pp 205–227
8Blanco-Canqui and Lal (2010) Blanco-Canqui H, Lal R (2010) Erosion control and soil quality. In: Principles of Soil Conservation and Management, Springer, pp 477–492

$f_{1} (a), f_{2} (b)$		$d i m (f_{1} (a) \cap f_{2} (b))$
$I (a), I (b)$		2
$I (a), E (b)$		-1
$B (a), E (b)$		-1