Review Comment:
This second version of the article clarifies the scope of this survey, i.e. listing and comparing point set distance measures for link discovery between geospatial resources described by vector geometries interpreted as ordered sets of points. These measures are compared against five criteria: time-efficiency, quality of data linking results, robustness to granularity discrepancies, robustness to measurement discrepancies and robustness to both types of discrepancies. Their effectiveness for geospatial data linking is eventually assessed against a benchmark geographic dataset, generated from existing datasets.
Geospatial resources linking is an important issue for the Semantic Web community and the article is rather clearly presented. Unfortunately, the assumptions behind the approach used to evaluate the surveyed distances measures depend too much on the presented use case. This tends to restrict the scope of this survey to link discovery between geospatial datasets with the same levels of detail, the same spatial density and the same geographical extent as the datasets used for evaluating the distances.
As a matter of fact, the choice of the basic distance used for all evaluated measures is not fully justified. The evaluated measures are computed based on both great-circle distance measure and great elliptic distance measure and the results are compared in terms of precision, recall, F-measure and run time. The great circle distance is eventually chosen based on equivalent linking results and lower computing times. However, it is regrettable that the same comparative test has not been applied to the Euclidian distance computed on projected geometries. This distance is not even considered by the authors because projections distort distances more drastically when they are applied on large areas and the datasets chosen for evaluating the surveyed distance measures have a large geographic extent. However, geospatial data linking tasks may also deal with datasets with a rather small geographic extent. In such cases a simple Euclidian distance on projected data might give the correct results in significantly shorter time than the orthodromic distance. For pan-European uses cases like the one presented here, the INSPIRE Directive [1] recommends the projected coordinates reference system Lambert Conformal Conic (ETRS89 - LCC). Conformal projections have a scale factor that varies in a known manner so that distances measures are affected in a predictable way and are locally affected in the same way. That is why providing a quantitative evaluation for the three distances on that use case and on datasets with a smaller geographic extent and discussing which distance should be preferably used in each case would have been interesting to strengthen the authors' choices and to help new practitioners make the best choice for their own use case.
Two modifiers are proposed and applied on the NUTS dataset: the geometries are altered to generate new datasets with various positional accuracy and lower spatial granularity. Reducing the level of detail of a dataset is a common task in cartography named generalisation, which aims at making maps legible at different scales. Contrary to what is done here, cartographic generalisation algorithms are designed to preserve the overall shape and topological consistency of the input geometries (see [2] for an example on a partition of some territory). Data linking tasks based on geometries with different level of detail commonly deal with geometries that have similar shapes and are topologically consistents. There may be data linking tasks on geometries with levels of detail so different that their shapes are totally different like NUTS and DBPedia. However such use cases are not representative of the majority of data linking tasks dealing with discrepancies in granularity. That is why I am not convinced by the approach proposed to generate benchmark datasets. Moreover, I am not persuaded that the conclusions drawn here on the effectiveness of the surveyed distances would be still valid on other datasets even with geometries that can be viewed as sets of points and having discrepancies in precision and granularity such as datasets representing buildings at different levels of detail in some dense urban area or on heterogeneous land use classifications represented by polygons and requiring 1:n or n:m links discovery.
Less importantly, I checked again the value of the Féchet distance computed with Geoxygene, based on the same orthodromic point-to-point distance as the one presented in the article, and I still get nearly 34 km.
Finally, providing such a survey about the availables distances measures for spatial data linking would be of great interest. As a matter of fact, there is an extensive literature about spatial data matching, scattered among several scientific communities and few surveys to help new practioners get started on that topic. Sadly, the survey proposed in this article seems too driven by the considered use case and lacks of hindsight. Considering different use cases and taking into account some references from the field of geographical data matching could help improve this work and make it a reference article for researchers willing to use the spatial reference as a data linking criterion.
[1] http://inspire.ec.europa.eu/documents/Data_Specifications/INSPIRE_DataSp...
[2] P Van Oosterom. The GAP-tree, an approach to ‘on-the-fly’map generalization of an area partitioning. GIS and Generalization, Methodology and Practice, 120-132
|