Building Spatio-Temporal Knowledge Graphs from Vectorized Topographic Historical Maps

Tracking #: 2742-3956

Authors: 
Basel Shbita
Craig A. Knoblock
Yao-Yi Chiang
Weiwei Duan
Johannes H. Uhl
Stefan Leyk

Responsible editor: 
Guest Editors ESWC 2020

Submission type: 
Full Paper
Abstract: 
Historical maps provide rich information for researchers in many areas, including the social and natural sciences. These maps contain detailed documentation of a wide variety of natural and human-made features and their changes over time, such as changes in transportation networks or the decline of wetlands or forest areas. Analyzing changes over time in such maps can be labor-intensive for a scientist, even after the geographic features have been digitized and converted to a vector format. Knowledge Graphs (KGs) are the appropriate representations to store and link such data and support semantic and temporal querying to facilitate change analysis. KGs combine expressivity, interoperability, and standardization in the Semantic Web stack, thus providing a strong foundation for querying and analysis. In this paper, we present an automatic unsupervised approach to convert vector geographic features extracted from multiple historical maps into contextualized spatio-temporal KGs. The resulting graphs can be easily queried and visualized to understand the changes in different regions over time. We evaluate our technique on railroad networks and wetland areas extracted from the United States Geological Survey (USGS) historical topographic maps for several regions over multiple map sheets and editions. We also demonstrate how the automatically constructed linked data (i.e., KGs) enable effective querying and visualization of changes over different points in time.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Alessandro Adamou submitted on 12/Jun/2021
Suggestion:
Minor Revision
Review Comment:

The paper describes a method to generate RDF datasets for historical maps, whose topological features change over time. The method was evaluated over a selection of US historical map data.

The paper expands upon work previously carried out and published by the authors, with the addition of further details on the process and updates that reflect the feedback received from previous publications. The novel content is significant tenough to justify this paper.

Particularly of note is that the work is getting closer to tackling the problem of present-day maps that have a potential to change at a much higher rate, with the transition to OpenStreetMap and scale considerations on the incremental potential of the approach. The considerations made on supporting the Linked Data principles in the generation of spatial data is also noteworthy, and the idea of constructing the URI out of hashing geometry and temporal extent data is sound - however, please clarify if the geometry contains at least some absolute coordinates, so as to rule out the possibility of hash clashes.

The method is described in detail and with clarity. One aspect that I believe needs to be clarified is to define a building block formally: the description seems to indicate the desirable characteristics of a building block, e.g. that it is shared across representations, that it can be a line of polygon etc. but, while it is rather clear what may qualify as a building block, it is not so clear what it actually is, for example, the largest/longest feature that has certain characteristics. That would help interpret the state in which Algorithm 1 terminates.

The approach was evaluated over a selection of historical maps, with adequate coverage of several aspects. Regarding the evaluation of feature partitioning, there are considerations made of the impact of the number of vectors over performance, which raises questions on whether generating few nodes from many features is actually a more efficient scenario (so it seems from the maps in Table 1, which might be best aided by a graphical view). It is understandable that performance considerations on entity linking were skipped, as it is primarily up to the reverse geocoding service. The evaluation of querying is also an important element, but the data in possession generate too few triples for a realistic evaluation of that aspect, which would probably require that a whole county or state be represented in such a way (i.e. consider the impact of large masses of data that do not satisfy the query, a highly likely scenario). The queries themselves do not present a particular performance problem, and it is natural for the subtraction one to take longer.

As a small curiosity: would one consider cases where a feature changes its very type over time? I can think of a large body of water drying out to become representable as a line, or the other way around: a river causing a flood and generating a polygonal feature. Is such a scenario sensible and contemplated? Would it be supported by the building block principles described here?

The authors have chosen GitHub as the host for the entire software suite and the data. The repository content is comprehensive and the process is appropriately documented in the README file, however I would add the following recommendations:
1. The whole repository seems to be licensed MIT: as that is mainly a software license, I would advise to pick a specific license for the data (e.g. a CC one)
2. prepare at least one release, or a GitHub packge, with the data
3. for better findability, tag the GitHub project with appropriate keywords

Finally, a number of minor corrections to recommend:
* Decide on "building block" or "building-block" as both are used inconsistently throughout (I would recommend the form without the hyphen)
* P4L29: "phenomenon" (singular, like entity)
* P6: the paragraph on "Geometry union-difference" should be bullet point number 3 ?
* Introduce the OSM acronym before using it, e.g. in P6L49: "OpenStreetMap (OSM)"
* P9L40 "As shown in Figure 10 ..."
* P9L50: "This is important due to the incremental nature" of what?
* P10L19: "Each of which..." which what? Each map, each dataset...?

Review #2
By Sergio Rodriguez Mendez submitted on 23/Jul/2021
Suggestion:
Minor Revision
Review Comment:

* Summary: The article describes an approach to convert vector geographic features extracted from multiple historical maps into contextualized Spatio-temporal KGs. The resulting graphs can be easily queried (GeoSPARQL) to understand the changes in different regions over time. The approach and its evaluation focus on linear and polygonal geographic features.

* Overall Evaluation (ranging from 0-100):
[Criterion 1]
[Q]+ Quality: 92
[R]+ Importance/Relevance: 85
[I]+ Impact: 88
[N]+ Novelty: 80
[Criterion 2]
[W]+ Clarity, illustration, and readability: 95
[Criterion 3]
[S]+ Stability: 100
[U]+ Usefulness: 90
[Perspective]
[P]+ Impression score: 88

* Dimensions for research contributions (ranging from 0-100):
(1) Originality (QRN): 86
(2) Significance of the results (ISU): 93
(3) Quality of writing (QWP): 88

* Overall Impression (1,2,3): 89
* Suggested Decision: [Minor Revision]

* General comments:
The work is solid. The paper is really mature and good: easy to follow and understand.
There is a novelty in the proposed approach.
The authors can improve more some specific parts of the manuscript by addressing the issues presented in this review and feedback.

* Major points for improvements:
{
pag#06 / L#38: "generate a global bounding box for s", how do you do it (calculate it)? explain details about this task. Also, a possible improvement would be not to calculate a "box" (rectangle) for the bounding area but rather another geometric object to perform a better approximation (using an ellipse?). Your thoughts about this. Could this have better results in line 2 of Algorithm 2? or, is a "box" a limitation of the reverse-geocoding function?

pag#08 / L#29: "record provenance data" --> it would be better to align the model with PROV-O

pag#11 / L#33: "computation cost is not exponential in practice" --> it would be better to present a proper time complexity analysis of both algorithms.

pag#12 / L#22: "linearly dependent" --> "polynomial depending on..." | the time complexity also depends on tasks/lines 1,2 of Algorithm 2. It would be better to present a proper time complexity analysis.

pag#12 / Table 6: Does the F1 score gives the same weight to recall and precision ("P&R")? or is it "2P&R" or "P&2R"?

---
pag#13 / L#3: (Fig#15) "dcterms:date" appears two times (for values: 1962, 2001). Does this means that a "building-block" (feature parts) could have multiple "dcterms:date" attributes? I guess that this would be the case; is an expected result from Section 2.6? This should be clarified; the fact that the feature parts can have multiple "dcterms:date" is not intuitive. If so, then it would be better to include the cardinality for each relation in Fig#8 (semantic model).

Fig#17: The query seems wrong. In your main BGP, you have "dcterms:date 1958^^xsd:gYear". Later on, you add "?f dcterms:date ?date", and in the HAVING clause you are performing a "COUNT(DISTINCT ?date)". If all feature parts have only one "dcterms:date" then all selected "?f" will have "dcterms:date 1958^^xsd:gYear", so the COUNT will be 1. Perhaps, you meant "COUNT(DISTINCT ?f)"? Again, this is related to the previous comment (and the issue of multiple "dcterms:date"). This should be clarified; it's not intuitive.

ADDENDUM: I found an example in the data files:
a geo:Feature ;
dcterms:created "2020-03-14T09:55:56.418490"^^xsd:dateTime ;
dcterms:date "1942-01-01T00:00:00"^^xsd:dateTime,
"1950-01-01T00:00:00"^^xsd:dateTime,
"1965-01-01T00:00:00"^^xsd:dateTime ;
geo:hasGeometry ;
geo:sfContains ;
geo:sfWithin ,
.
So, yes, the geographic features can have multiple dates. But, it's better to clarify this in the manuscript (comments above).

---
"unsupervised": the authors used the word "unsupervised" 7 times in different parts of the paper (abstract, sec#1, sec#1.2, sec#2.2, sec#4, sec#5). The main usage is as in "automatic unsupervised approach/pipeline". Some examples:
(a) (pag#12 / L#39): Why the linking task has "unsupervised characteristics"? This is not clear at all.
(b) (pag#15 / L.22): "present an unsupervised pipeline"; why "unsupervised"?.

Why the authors are using the word "unsupervised"? The presented pipeline is a set of "non-Learning" algorithms and tasks. Please, clarify this point or drop the word.
}

* About the data files and related software artifacts: (“Long-term stable URL for resources”)
(1) "data files are well organized and contain a README file which makes it easy to assess the data": [YES], although there is no specific description about the data files in the README file of the repo.
(2) "the provided resources appear to be complete for replication of experiments": [YES. COMPLETE]
(3) "the chosen repository is appropriate for long-term repository discoverability": [YES. GitHub]
(4) "the provided data artifacts are complete": [YES]

* Minor corrections:
pag#05: L#14 "Data" --> "Input"; L#17 "Result" --> "Output"
pag#06 / L#45: Capital R in section title.
pag#09 / L#08: What is the meaning of the symbols below the column labels in Fig#9? IDs vs. literals? Please, clarify or remove them.
pag#14 / Fig#18, Fig#19, Fig#20: ", marked in" --> "; marked in" or "are marked in"
pag#15 / L#11: "exposed as the data as" --> "exposed the data as"
[everywhere]: "web" --> *Web*, "semantic web" --> "Semantic Web"

Review #3
By Mohamed Sherif submitted on 12/Aug/2021
Suggestion:
Major Revision
Review Comment:

Building Spatio-Temporal Knowledge Graphs from Vectorized Topographic Historical Maps

The paper proposes an unsupervised approach to convert vector geographic features extracted from multiple historical maps into an RDF knowledge graph. The proposed algorithm has three basic steps: geomeries partitioning, linking and knowledge graph generation. The authors evaluated the proposed technique on railroad networks and wetland areas extracted from several editions of the United States Geological Survey (USGS) historical topographic maps.

(1) originality
The idea is investigated by the same authors before in the conference and workshop papers [21, 22]. The current paper extends the authors' previous work to support polygon-based geometries and provides more detailed evaluation.

(2) significance of the results
The evaluations provided by the authors suggests that the proposed approach is feasible and effective in terms of processing time, completeness and robustness. Still. The proposed approach has some limitations, which are highlighted by the authors in the last section of the paper.

(3) quality of writing
The writing of the paper is generally excellent, I provides a couple of typos within my detailed comments.

(4) availability
The one thing still missing in this work is to provide a SPARQL endpoint to serve the generated knowledge base as well as to offer the W3C standard compliant IRI dereferenciation (maybe via https://github.com/LodLive/LodView). I would also suggest adding a knowledge graph characteristic table that contains the links to the GitHub project (if any) as well as the endpoint location of the knowledge graph, example resources, some statistics such as number of triples, classes, entities, etc.

Next, I give more detailed comment about each section of the paper:

Section 2:
- Section 2.1 and Section 2.3 shared a large portion of text describing Fig. 4 and 5
- Page 7, lines 41, 47 and 57 are running over the page right margin

Evaluation:
- In Section 3.2, it is not clear to me how the authors compute the precision, recall and F-measure without the existence of a gold standard linking mapping. Did you label every/part-of the generated links by humans?.
- In Section 3.2, the part where you use the “human label” is not clear to me.
- Use listing instead of figures for SPARQL queries such as the ones in Fig.10, 15-17

Related work
- Page 15, line 3 (right): Some related work that could be used for linking “unlabeled geospatial entities (containing only geometry data, without any additional attributes)” include [a] and [b].

[a] Sherif, Mohamed Ahmed, Dreßler, Kevin, Smeros, Panayiotis and Ngonga Ngomo, Axel-Cyrille. "RADON - Rapid Discovery of Topological Relations." Paper presented at the meeting of the Proceedings of The Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2017.

[b] Ahmed Sherif, Mohamed, and Axel-Cyrille Ngonga Ngomo. "A systematic survey of point set distance measures for link discovery." Semantic Web journal 9, no. 5 (2018): 589-604.

References:
- Ref. [10] has no title
- Ref. [27] “Dbpedia” → “DBpedia”