Review Comment:
The manuscript describes a knowledge graph with information about a corpus of geographical works of the past, specifically from the medieval and renaissance period.
The authors appropriately motivate and describe such resource, created in the context of a dedicated project.
This is my opinion a valuable contribution to the journal, if considered in the "Dataset Descriptions" category (referring to https://www.semantic-web-journal.net/reviewers#types).
The paper was sumbitted in the "Full Paper" category, though.
It must thus be evaluated as a research contribution.
Anyway, in order to reduce the back and forth, I split the description of the main issues below in two sections.
While the first section address the relevance as research contribution, the second one, longer, address aspects that are relevant also if the category changes.
# Issues as a Research Contribution
The main contribution to the field, according to what the authors state, is the creation and pubblication of the KG, which in my view does not amount to a research contribution on itself.
Specifically, a research contribution should clearly state one or more research questions and describe an experiment that can address some aspect of those.
One way to go forward, if the authors want to go with this type of contribution, would be to focus on some aspects of the pipeline/metodology that they consider of general interest and novel.
They should detail why it is the case and evaluate them in multiple use cases.
That means more experimentation and an extensive rewriting of the paper.
# General Issues
## 1. Paper Organisation
The section names "Ontology Population" and "Knowledge Graph Creation" are in my opinion misleading: the first one is about the tool used to generate the instances, not the terminology; the second one decribes technical details of the implementation and validation.
I would suggest having only a "Knowledge Graph Population" section, reorganised (see 4.) and possibly containing dedicated subsections (e.g., "Data Entry Tool", "Implementation", "Validation").
The section name "Data Analysis" is very generic. Maybe it should be called "Evaluation", since that is the purpose of the section.
The GUI proposed to access the information in the KG, mentioned in the abstract, Introduction, Discussion, and Conclusion, is only described in the Discussion.
This is an odd choice, one would expect the description of the GUI to be placed before the evaluation ("Data Analysis" section).
## 2. Ontology Design
The ontology design process is described and appears sound.
Nevertheless, as there many established ontology design methodologies, it would be good to mention if a specific existing one was adopted or what are the reasons for proceeding otherwise.
Furthermore, the authors do not refer to any documentation of intermediate results (minutes/details of the interviews, scenarios, competency questions).
## 3. Ontology/KG Description
### 3.1 Missing Diagrams
The description of the ontology itself is quite minimal and devoid of diagrams.
This is partly justified by the reuse of existing models.
Nevertheless, it would helpful to see visually the main classes and how they are related (focusing on the main relations and the terms that are used when populating the knowledge graph).
### 3.2 Ontology/KG Structure
There is no description of the modular organisation of the ontology, while from the OWL code can be seen that the main ontology module imports other two ontologies:
- , an OWL representation of LRMoo, including also CIDOC CRM;
- a geographical thesaurus.
Similarly, the KG available at the SPARQL endpoint is organised in three named graphs but the manuscript do not mention that nor explain such organisation.
The three named graphs are the following ones:
- , with 127,899 triples;
- , with 6,430 triples;
- , with 7,932 triples.
The total triples checks with the number given in the paper (142,047) so this organisation is presumably not novel.
Based on the name, the third one comes probably from the Mapping Manuscript Migrations (MMM) project.
The manuscript mentions that the MMM dataset *can* be integrated.
Have (a part of) it already been integrated? If yes, how exactly?
### 3.3 Incoherencies between Description and Implementation
Many classes and properties listed respectively in tables 1 and 2 are not defined in the ontology.
Precisely all the terms stated as equivalent to existing ones (in CIDOC CRM or FRBRoo/LRMoo) are actually neither defined in the ontology nor used in the KG.
For those cases, the KG directly adopts the original terms from CIDOC CRM or FRBRoo/LRMoo.
While this choice is totally understandable and welcome for the purpose of favouring interpoperability, the description in the paper is highly misleading.
The authors should just state the extensions they made to the existing ontologies and then describe (again, diagrams would help) how the data have been modeled: i.e., using terms defined by them as well as some terms from existing ontologies.
In addition to what already mentioned, table 2 has two issues:
- property names are not shown, only domain and range;
- the stated equivalencies (which, again, are not actually represented in the ontology) are formally incorrect because they are between properties that have different domains/ranges.
From the Mapping Manuscript Migrations Metadata Schema (http://ldf.fi/schema/mmm/) a single class (Source) and a single property (data_provider_url) are used in the KG, for MMM. This is not documented.
### 3.4 Evolution/Maintainance
The manuscript does not specify if there is any planned method to update the ontology and the KG.
Specifically, while the code of the software tools is published in public repositories (on GitHub), the authors do not mention if there is a repository to track the evolution of the ontology.
### 3.5 Reporting Guidelines
Finally, it would be good to explicitly refer to existing best practices and guidelines for the documentation of ontologies and KGs.
An example, for the ontology, are the MIRO guidelines [1], which are quite detailed and have quite broad adoption.
## 4. Knowledge Graph Population
### 4.1 Description and Motivation of the Adopted Population Process
Before going into the technical details, the process of KG population should be described in more general terms.
It is unclear what the input of the process is and how the experts use the annotation tool.
How the corpus of manuscripts has been selected?
Do the experts analyse each manuscripts and then fill the fields in the application?
Is any pre-existing knowledge (metadata already associated to a manuscript) used?
Is any automatic annotation system adopted, even if just for suggestions? If not, why?
### 4.2 Validation
The second paragraph of "Knowledge Graph Creation" mentions the validation of the KG using a dedicated tool (Openllet).
The authors assert that the KG successfully passed four validation tasks:
(i) logical consistency;
(ii) correspondence between "the class hierarchy" and "the structure defined by the IMAGO ontology";
(ii) data integrity;
(iv) ability to support complex SPARQL queries.
A part from the first one that is straightforward, the other tasks would require a more detailed description.
What is the meaning of task (ii), are not those two (the class hierarchy in the KG and the one in the IMAGO ontology) the same thing?
How data integrity is validated in task (iii)?
What exactly task (iv) does? Does it executes a set of predefined queries? Are there the same shown in the paper? Does it generate queries somehow? How are the results checked?
Specifically, in respect to data integrity, ontological formalisation alone is not of very much use for imposing constraints on a dataset.
For that purpose shape languages, like SHACL, are often employed.
Have you considered implementing shapes-based validation?
Finally, as for other aspects of the design/implementation process (see 4.1) it would be good if the authors share the associated data, in this case the configuration and output of the validation tool.
## 5. Ontology/KG Availability
Both the ontology and the KG are publicly available online. Nevertheless there are a some issues that should be addressed.
## 5.1 Long-Term Persistence of URLs
Both the ontology modules, the SPARQL endpoint, and the individuals of the KG use a project-related namespace (https://imagoarchive.it/).
These kind of URLs are at risk of breaking if the project stops being maintained or there is some organisational change.
Authors should consider using w3id (https://w3id.org/) or similar redirection services to decouple the namespaces adopted for URLs/URIs from the servers currently holding the implementation/data.
## 5.2 Long-Term Persistence of Datasets
The ontology and the KG are only available on the afore mentioned project-related host.
It is highly advisable to upload snapshots of those resources on public repositories like Zenodo or Figshare, especially considering that neither the ontology (see 3.4) nor the KG can be generated if the project host becomes not available.
Furthermore, usage of persistent identifiers for specific versions of the resources allows to track their history and associate the paper with specific versions.
## 5.3 Availability of KG Dumps
Currently there is no way to directly download the full KG dataset as a dump (it can be currently done with few CONSTRUCT queries on the SPARQL endpoint, but that would become trickier if the size increase).
The usage of public repositories to store the dumps, as recommended in 5.2, would address also this issue.
## 5.4 URI Deferenceability
It is good practice to use derefenceable URIs, employing content negotiation to respond with either a human-tailored description of the resource (a web page) or a machine-readable description (some RDF serialization).
The URIs used in this KG are instead not derefenceable, neither individuals (e.g., https://imagoarchive.it/ontology/resources/manifestation/manuscript/mm-626) nor ontology terms (e.g., https://imagoarchive.it/ontology/has_curator).
In the case of the ontology modules, the URI ontology as a whole is derefenceable (e.g., https://imagoarchive.it/ontology/) but only as a machine-readable resource (in RDF Turtle).
To get the human-readable documentation of the main ontology a different URL must be used (https://imagoarchive.it/doc/index-en.html).
## 6. Ontology/KG Implementation
In ontologies and KG multiple incompatible namespaces are used for both CIDOC CRM and FRBRoo.
Namespaces for CIDOC CRM:
- , used in ontology modules;
- , used in the KG, for archive and toponyms;
- , used in the KG, for MMM.
Namespaces for FRBRoo:
- , used in ontology modules and in the KG, for archive and toponyms;
- , used in the KG, for MMM.
Datatype property has_reprint_date has xsd:string as range. Any reason to prefer xsd:string to xsd:date or xsd:datetime?
For the entity representing the whole ontology in https://imagoarchive.it/ontology/ (the one with rdf:type owl:Ontology) a blank node is used, instead of a URI.
The thesaurus included along the ontology (https://imagoarchive.it/Thes) has labels only in Italian and no definitions (using rdfs:comment or similar properties).
These limitations may hinder reusability of the dataset, especially by people not understanding Italian.
And even Italian-speaking experts may not be able to guess the precise meaning of a topic if it has not accompanying definition.
## 7. Evaluation
The evaluation of the KG is based on its ability to perform six types of "knowledge extraction targets" with corresponding SPARQL queries.
Albeit not called explictly that way, these have the role of competency questions (CQs) in designing and evaluation the ontology/KG.
While CQs are often adopted as a mean to evaluate ontologies and KGs, there should be other forms of evaluation, involving users/experts not involved in the design process. And possibly multiple usage contexts.
The authors mention that they are doing a user-based evaluation of the GUI. That could be included as a form of "in-use evaluation" of the ontology/KG (albeit mediated by that specific UI).
Furtermore, it would be good to discuss the role of the described queries in the context of higher-level tasks performed by the experts (e.g., researching a topic) and draw comparisons with existing query methods and repositories.
# References
[1] Matentzoglu, Nicolas, et al. "MIRO: guidelines for minimum information for the reporting of an ontology." Journal of biomedical semantics 9.1 (2018): 6.
|