A Systematic Survey of Point Set Distance Measures for Link Discovery

Tracking #: 1281-2493

Authors: 
Mohamed Sherif
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Claudia d'Amato

Submission type: 
Survey Article
Abstract: 
Large amounts of geo-spatial information have been made available with the growth of the Web of Data. While discovering links between resources on the Web of Data has been shown to be a demanding task, discovering links between geo-spatial resources proves to be even more challenging. This is partly due to the resources being described by the means of vector geometry. Especially, discrepancies in granularity and error measurements across data sets render the selection of appropriate distance measures for geo-spatial resources difficult. In this paper, we survey existing literature for point-set measures that can be used to measure the similarity of vector geometries. We then present and evaluate the ten measures that we derived from literature. We evaluate these measures with respect to their time-efficiency and their robustness against discrepancies in measurement and in granularity. To this end, we use samples of real data sets of different granularity as input for our evaluation framework. The results obtained on three different data sets suggest that most distance approaches can be led to scale. Moreover, while some distance measures are significantly slower than other measures, distance measure based on means, surjections and sums of minimal distances are robust against the different types of discrepancies.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 13/Feb/2016
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

I thank the authors for having addressed the issues raised. The paper is definitely suitable as introductory text, comprehensive, well-written and important to the Semantic Web community.

Review #2
By Michael Guthe submitted on 07/Mar/2016
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

(1) The text is a very good introduction into the field of distance measures.
(2) The paper is easy to understand and the area is well covered.
(3) The presentation is clear.
(4) For me it is hard to judge the importance of distance measures in the Semantic Web community. The paper is however a very good comparison between different measures, so if these are important, the paper is important as well.

Review #3
By Nathalie Abadie submitted on 05/Apr/2016
Suggestion:
Major Revision
Review Comment:

This second version of the article clarifies the scope of this survey, i.e. listing and comparing point set distance measures for link discovery between geospatial resources described by vector geometries interpreted as ordered sets of points. These measures are compared against five criteria: time-efficiency, quality of data linking results, robustness to granularity discrepancies, robustness to measurement discrepancies and robustness to both types of discrepancies. Their effectiveness for geospatial data linking is eventually assessed against a benchmark geographic dataset, generated from existing datasets.
Geospatial resources linking is an important issue for the Semantic Web community and the article is rather clearly presented. Unfortunately, the assumptions behind the approach used to evaluate the surveyed distances measures depend too much on the presented use case. This tends to restrict the scope of this survey to link discovery between geospatial datasets with the same levels of detail, the same spatial density and the same geographical extent as the datasets used for evaluating the distances.
As a matter of fact, the choice of the basic distance used for all evaluated measures is not fully justified. The evaluated measures are computed based on both great-circle distance measure and great elliptic distance measure and the results are compared in terms of precision, recall, F-measure and run time. The great circle distance is eventually chosen based on equivalent linking results and lower computing times. However, it is regrettable that the same comparative test has not been applied to the Euclidian distance computed on projected geometries. This distance is not even considered by the authors because projections distort distances more drastically when they are applied on large areas and the datasets chosen for evaluating the surveyed distance measures have a large geographic extent. However, geospatial data linking tasks may also deal with datasets with a rather small geographic extent. In such cases a simple Euclidian distance on projected data might give the correct results in significantly shorter time than the orthodromic distance. For pan-European uses cases like the one presented here, the INSPIRE Directive [1] recommends the projected coordinates reference system Lambert Conformal Conic (ETRS89 - LCC). Conformal projections have a scale factor that varies in a known manner so that distances measures are affected in a predictable way and are locally affected in the same way. That is why providing a quantitative evaluation for the three distances on that use case and on datasets with a smaller geographic extent and discussing which distance should be preferably used in each case would have been interesting to strengthen the authors' choices and to help new practitioners make the best choice for their own use case.
Two modifiers are proposed and applied on the NUTS dataset: the geometries are altered to generate new datasets with various positional accuracy and lower spatial granularity. Reducing the level of detail of a dataset is a common task in cartography named generalisation, which aims at making maps legible at different scales. Contrary to what is done here, cartographic generalisation algorithms are designed to preserve the overall shape and topological consistency of the input geometries (see [2] for an example on a partition of some territory). Data linking tasks based on geometries with different level of detail commonly deal with geometries that have similar shapes and are topologically consistents. There may be data linking tasks on geometries with levels of detail so different that their shapes are totally different like NUTS and DBPedia. However such use cases are not representative of the majority of data linking tasks dealing with discrepancies in granularity. That is why I am not convinced by the approach proposed to generate benchmark datasets. Moreover, I am not persuaded that the conclusions drawn here on the effectiveness of the surveyed distances would be still valid on other datasets even with geometries that can be viewed as sets of points and having discrepancies in precision and granularity such as datasets representing buildings at different levels of detail in some dense urban area or on heterogeneous land use classifications represented by polygons and requiring 1:n or n:m links discovery.
Less importantly, I checked again the value of the Féchet distance computed with Geoxygene, based on the same orthodromic point-to-point distance as the one presented in the article, and I still get nearly 34 km.
Finally, providing such a survey about the availables distances measures for spatial data linking would be of great interest. As a matter of fact, there is an extensive literature about spatial data matching, scattered among several scientific communities and few surveys to help new practioners get started on that topic. Sadly, the survey proposed in this article seems too driven by the considered use case and lacks of hindsight. Considering different use cases and taking into account some references from the field of geographical data matching could help improve this work and make it a reference article for researchers willing to use the spatial reference as a data linking criterion.

[1] http://inspire.ec.europa.eu/documents/Data_Specifications/INSPIRE_DataSp...
[2] P Van Oosterom. The GAP-tree, an approach to ‘on-the-fly’map generalization of an area partitioning. GIS and Generalization, Methodology and Practice, 120-132