OnGIS: Semantic query broker for heterogeneous geospatial data sources

Tracking #: 1358-2570

Authors: 
Marek Smid
Zdenek Kouba

Responsible editor: 
Boyan Brodaric

Submission type: 
Full Paper
Abstract: 
Querying geospatial data from multiple heterogeneous sources backed by different management technologies poses an interesting problem both in the data integration and the subsequent result interpretation. This paper proposes broker techniques for answering a user's complex spatial query: find relevant data sources (from a catalogue of many) capable of answering the query, eventually split the query and find relevant sources for the query parts, when no single source suffices. For the purpose, we describe each source with a set of prototypical queries that are algorithmically arranged into a lattice, which makes searching efficient. The proposed algorithms leverage GeoSPARQL query containment enhanced with OWL 2 QL semantics. A prototype is implemented in a system called OnGIS.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 18/May/2016
Suggestion:
Reject
Review Comment:

This work studies the problem of querying heterogeneous geospatial data sources using the geoSPARQL language. Firstly, the authors extend the GeoSPARQL vocabulary by defining sub-properties relations. Then, exploiting the proposed extension, a method that identifies containment between queries is described. Finally, a lattice structure is used to arrange several queries so that contained relations can be easily captured.

The problem considered by this work is interesting and useful. However, I think that the presented work is not ready for publication. In brief, the presentation and language of the paper should be considerably improved. Additionally, I have a concern regarding the contribution and the novelty of this work, since it is simple adopts already developed techniques and tools. Another crucial issue is that a large number of related works are missing. The authors should significantly extend the related work section and compare their approach with the existing works. Finally, an experimental evaluation should be conducted.

In more detail:

The paper does not define or even clearly describe the working scenario and/or the studied problem. A (formal) problem/scenario description should be provided in the introduction section. Additionally, it will be very beneficial to use a running example (as the one presented in Listing 1) throughout the paper.

The paper does not consider a large number of related works. Particularly, the authors should at least mention and compare with the works in the areas of RDF query federation, distributed RDF querying and SPARQL rewriting. For example, see [1-5] and their references. For these works, the authors should describe what are the differences in their setting and why the existing approaches cannot be applied.

Regarding the structure of the paper, Section 4.2 should be form a subsection of Section 3. In addition, I think that a large number of the details provided in Sections 3.1 & 3.2 should be removed. Finally, in Section 4.3 the definition of TR(OP) should precede the definition of Rel(i,q_2).

In Section 4 the authors provide a very abstract description for the parts of SPARQL that they support. A (formal) definition (e.g., a grammar) regarding the supported SPARQL parts is required.

There are several issues with Section 5. Particularly, the definition/construction of the set of queries that are used to generate the lattice is not described. In addition, several issues regarding query splitting are mentioned in a very confused/abstract manner.

An experimental evaluation should be conducted, considering several issues: e.g., the time required for building and traversing the lattice, query splitting vs number sources, etc. Finally, an experimental comparison with existing approaches is required.

Several comments:

- Abstract, Section 1, 4.2, etc: you should define “prototypical queries”
- Abstract: “which makes searching efficient” -> This is not shown somewhere; how do you know that search is efficient?
- Section 1, par 5: query containment is not a “method”
- Section 1: You should remove the details/descriptions for most of the acronyms; in current form it seems very confusing. For example: “OGC – Open Geospatial Consortium^1, a large consortium, is setting many standards, e.g. WFS – Web Feature Service^2, a standard for publishing vector spatial data, WCS – Web Coverage Service^3, a standard for publishing source raster data, ESRI ArcGIS services, etc.)”
- Section 3: “it adds a few features to RDFS, including limited negation, which suffices to our purposes” - > not “a few” – but a large number of features.
- Section 4: “Graph patterns and filters are supported, however optional patterns, ordering, grouping, offsets and limits are not.” -> “Graph patterns” is not correct here, since graph pattern includes optional. You should replace “graph patterns” with “basic graph patterns”.
- Section 4.3: you should describe the intuition for the effective topological relation.
- Section 5: Enrich the section’s title “Lattice” → e.g., building and traversing lattice structure
- Section 5.2, 1 par: Section 5→ Section 5.1

[1] Olaf Gorlit et al. Federated Data Management and Query Optimization for Linked Open Data New Directions in Web Data Management 1, volume 331 of Studies in Computational Intelligence. Springer 2011
[2] Andreas Schwarte et al. FedX: Optimization Techniques for Federated Query Processing on Linked Data ISWC’11
[3] Audun Stolpe A logical characterisation of distributed SPARQL processing SWJ 2015
[4] Hartig et al. Executing SPARQL Queries over the Web of Linked Data. ISWC ’09
[5] B. Quilitz and U. Leser. Querying Distributed RDF Data Sources with SPARQL ESWC’08

Review #2
Anonymous submitted on 25/May/2016
Suggestion:
Major Revision
Review Comment:

OnGIS: Semantic query broker for heterogeneous geospatial data sources

Overview:

This paper described an algorithm to implement query broker for
geospatial data represented in the form of ontology knowledge base.
Queries supported by data sources are organized in a lattice. The partial
order of the queries in the lattice is based by query containment relation.
The query containment relation is decided by reduction to the problem of
DLR Abox satisfiability.
The paper considered decision procedure for query containment with
extensions to filters and topological relation.
The paper further presented pseudocode for building query lattice, searching
query lattice, and answering user queries with the query lattice
(with heuristics-based query splitting).

General comment:

The topic of paper is interesting and significant.
However, the paper is not well organized and some parts are difficult to understand.
In particular, the sections feel disjoint.
The paper lacks an overview of semantic query broker, which (I think) is based on a
lattice of queries supported by data sources, while the lattice is based on
query containment relation.
After a general introduction and related work, the paper jumps to GeoSPARQL, OWL QL,
and query containment but never stated why they are there.
I had to go through many pages until Section 5 before I realized why Section 2-4 are included.
Also, the paper is too detailed in some parts while not detailed enough in others (I will
mention a few places next.)

Specific comment:

1. How about some description of the implementation (OnGIS) or experiment
based on the proposed idea?

2. Query containmenet was repeatedly mentioned from page 1 to page 3 but it was not defined (even informally) until Section 4.

3. In Section 3, before you talk about GeoSPARQL and OWL QL, how about
explaining the relation between query broker and query containment?

Some background may be necessary here.
Do you define data sources as materialized views expressed as queries on logical views?

4. Some details of GeoSPARQL such as the operators of simple feature,
egenhofer, and RCC8 in Section 3.1 are really not necessary. Some citations would be
sufficient.

5. Is Table 1 really necessary?

6. Section 4, paragraph 1. This is the first place you defined query containment. It should be
stated much earlier.

7. Section 4.2 is the most confusing part of the paper.
Authors mentioned that the decision procedure for query containment was derived from that of
another paper but with some simplification and modification.
However, the simplification part and how it impacts the decision procedure are not explained.
The paper uses informal language to explain a quite complex algorithm but without giving
enough details for users to understand.
The authors may want to skip all the explanation and simply point to the original paper for
the algorithm and explain the simplification and extension that must be made to make it work.
As an example of the confusing part, the definition of "complete" A-box is for what purpose?
I see that you tried to explain it in page 7 (1st paragraph of left column but that paragraph
is really cryptic).

The definition of testing ontology is also confusing.
When you write \bar(O)(C(i_x)) = (O \cup A_2) \ { C(i_x) } \cup A_1 \cup { NC(i_x) }
it seems that you are defining \bar(O) for inputs of C(i_x) but where does this C(i_x) come from?
Does it belong to A_1 or A_2 or it doesn't matter.
Also, why set \bar(O)(C(i_x)) to the entire set of O or A_1?

8. Page 7, paragraph above Section 4.3, you mentioned that you assumed the variables are
consistently named between the queries and renaming wasn't considered. You should have
stated it first. I have been wondering how you are matching the variables in q1 and q2.

9. Example in Listing 1 is confusing.
What is q1 and what is q2? Are they sharing all the triples but the last two?
How about just write two queries separately.

10 Section 5. Why do you talk about a lattice all of sudden? I can guess why but some explanation
of its purpose in query broker would be useful.

11. Section 5.1. You mentioned top/bottom of the lattice where the top is all answers and bottom is no answers. Then you used the term root node of the lattice, which I guess is the bottom. Later on, you said that you don't even have a top. This is very confusing. Wny not just say that you start at the bottom of the lattice and work your way up?

12. The second sentence of the last paragraph of Section 5.1 is very confusing.

13. Something may be wrong with the 3rd sentence of the first paragraph in Section 5.2.

14. The example in Section 6 is too simple to show the utility of the algorithm.

Review #3
Anonymous submitted on 09/Jun/2016
Suggestion:
Major Revision
Review Comment:

SUMMARY:

The paper addressed the problem of flexibly querying heterogenous geospatial (GIS) data sources. For that, the authors describe the system OnGIS, a semantic query broker, which is based on a framework for querying answering (QA) over multiple GIS data sources. It is based on a set of prototypical queries, which are matched to the sources at query time. As a query language, they use a restricted version of GeoSPARQL without optionals (which would introduce non-monotonicity), but introduce as the ontology language DL-Lite instead of RDF(S) (which is the standard). For query decomposition and QA, they extend existing query containment techniques, i.e. the work of Horrocks [1], to (spatial) topological relations, which include RCC8, Egenhofer (DE-9IM), and simple relations. They provide a new method of calculating spatial containment based on "topological relation restriction", which establishes a hierarchy of spatial containment (and equality). Their "simple" algorithm for query evaluation uses query containments to build a lattice of related subqueries, which splits the query at the first level of the lattice. For optimizations, they propose a pruning heuristics to eliminate redundant query subsets. Finally, the provide an example scenario using LinkedGeoData, GeoNames, and a custom KB to illustrate how their approach would be applied in practice.

The main contributions of the paper are:
- The authors give a good overview of GeoSPARQL and the related data model.
- They develop an geo-ontology used with GeoSPARQL, which puts the different topological relation (RCC8, Egenhofer, etc.) into a common framework.
- They give detailed insight into standard query containment technique and suggest that it can be applied to DL-Lite.
- The technique for query containment is extended with spatial containment using topological relation restrictions. The give a good description of their technique, which is novel and interesting.
- They use their technique for an algorithm to calculate a lattice of sub-query containments based on an input query.
- The lattice then is used to evaluate the queries based on the different data sources. Further they refine the algorithm using pruning to eliminate redundant queries.
- Finally the show by an example, how their approach would work in a "real-world" scenario.

ORIGINALITY:

The originality is given, since the authors aim to lift existing algorithms for query containment to the spatial setting. Using a lattice of containments for query evaluation is not entirely new, but an interesting approach for evaluation.

SIGNIFICANCE:

This work is surely significant, if some details are clarified, since there are still not that many approaches and frameworks which aim to combine DL-Lite, GeoSPARQL, and spatial relations with heterogenous data sources.

QUALITY OF WRITING:

The paper is well structured and straightforward to read. However the authors write mainly in the 3rd person, which makes it sometimes hard to read. Some minor issues are below.

TECHNICAL SOUNDNESS:

Correctness of the provided algorithms is only sketched, and we doubt that it is well thought-out. In particular, the authors do not address the specific issues which are important if one deals with DL-Lite. They do not mention anything regarding query rewriting using PerfectRef or similar algorithms, nor do they address how to deal with existentials (unknown individuals) on the right-hand side of inclusion assertions. Things are getting even more complicated if SPARQL and spatial relations are used for QA over DL-Lite.

MAJOR ISSUES:

The topic of the paper and the main ideas are appealing. We believe that this would make a good conference paper but is not yet mature enough to be published in SWJ. For acceptance following major need should to be fixed (sorted by severity):

(1) Details in technical soundness.

(2) Since their approach is more on the practical side, we would expect at least a thorough experimental evaluation of the techniques and algorithms. Also the authors should provide mored details on the implementation itself. We believe, scalability is an important challenge, since they deal with huge datasets, e.g., LinkedGeoData. We doubt that the proposed algorithms scale, since they the entire KB has to be grounded to determine the containment of two sub-queries.

(3) Related work is not thoroughly covered, beside the above mentioned papers, recent work regarding QA with DL-Lite with RCC8 (see Möller [2]) and DL-Lite with spatial relations (see Eiter [3]) is not discussed. Besides Parlament, there are other GeoSPARQL engines, which need to be discussed, i.e., Strabon [4].

(4) We believe that the title is somewhat misleading, since the authors do not focus on providing a broker framework for a geospatial, but mainly write about query containment and evaluation techniques. If the focus is one the broker framework, we would appreciate more details on the broker architecture, the matching of the query templates to the actual query, and the techniques to deal with distributed data sources (e.g., traditional GIS web services).

(5) Query containment is an important technique for query optimization initially developed for relational databases (see Chandra [5]) and extended to DL-Lite (see Bienvenu [6]). However, there are other techniques available, i.e., tree-decomposition (see Maier [7]) which are often used to split the input query in sub-queries and building an evaluation strategy (using heuristics). It might be interesting to compare the different techniques side-by-side.

(6) In the introduction, a section on the general setting and the motivation of this work is missing.

MINOR ISSUES:

- Some sentences sound "unconventional", for example:
- "we use technique requiring ..." -> "Our technique requires"
- "The most important is the data itself, given in a form of serialization" -> "Serialized data is important, ..."
- "One part of supporting semantics of GeoSPARQL" -> "The semantics of GeoSPARQL is lifted to ..."
- "if it is semantically equivalent to a query", please clarify;
- They introduce the definition of DL-Lite, but miss the definition for the value-domains (see DL-LiteA);
- Please explain, why filters do not posses OWA?
- How does TR(OP) relate to the encoding matrix of the DE-9IM?
- The title "Lattice" seems a bit short;
- Fig. 5 needs more explanations;
- Listing 2 - 4 could go into the Appendix.

REFERENCES:

[1] Horrocks,I., Sattler,U., Tessaris,S., Tobies,S.: How to decide query containment under constraints using a description logic. Logic for Programming and Automated Reasoning, pp. 326–343 (2000)
[2] Özçep,Ö.L., Möller,R.: Scalable geo-thematic query answering. In: Proc. of ISWC 2012.LNCS, pp. 658–673 (2012)
[3] Eiter,T., Krennwallner,T., Schneider,P.: Lightweight spatial conjunctive query answering using keywords. In: Proc. of ESWC 2013. pp. 243–258 (2013)
[4] Kyzirakos,K., Karpathiotakis,M., Koubarakis,M.: Strabon: A semantic geospatial DBMS. In: Proc. of ISWC 2012. pp. 295–311 (2012)
[5] Chandra,K., Merlin,P.M.: Optimal implementation of conjunctive queries in relational data bases. In: Proc. of STOC 1977. (1977)
[6] Bienvenu,M., Lutz,C. Wolter,F.,: Query Containment in Description Logics Reconsidered. In: Proc. of KR 2012 (2012)
[7] Maier,D.: The Theory of Relational Databases. Computer Science Press (1983)