Review Comment:
**summary of the paper
This paper deals with a key issue in ontology and knowledge graph (KG) management: the evaluation of their evolution. This evaluation is important to ensure that the ontology remains consistent over time, and relevant even after updates or maintenance actions. The authors propose several metrics that can characterize a KG and contribute to evaluate its dynamics, i.e. to qualify if the changes brought to this graph have a positive or a negative impact on its quality.
The paper starts with a review of related works, focusing on existing metrics to measure and evaluate graph quality and allow an automatic evaluation of their quality after some changes. They identify several kinds of metrics, bearing either on syntax, semantics or pragmatics. This review has motivated their research because it underlines that most quality evaluation metrics and methods reflect the status of the ontology without considering its changes and evolution.
They also acknowledge that most available tools for an automatic ontology quality evaluation (like OntoMetric, DOORS et OQuaRE) perform a static evaluation of the KG or ontology quality, and do not account for the evaluation of changes in these models.
They conclude that there is a the need for a new tool that would gather all quality and evolution metrics from various methods with a dynamic perspective, enabling to automatically measure if changes in the ontology or KG had a positive or a negative impact of its quality. Such metrics and tools could improve the management of ontology evolution, support the detection of changes that introduce errors or inconsistencies in the ontology
Then the paper presents the main two contributions: 2 sets of metrics (syntactic and semantic metrics) for assessing ontology quality adapted from the work by Mc Daniel et al, 2018. The syntactic feature evaluate formal quality of the graph and its structure, like the syntactic richness and structure. The adaptation of the syntactic features aims at normalizing the value of these features, and at integrating how much the graph is compliant with some rules, its richness and the hierarchical structure. The semantic features are modified in order to evaluate the semantic consistency, to evaluate how close the model reflects a domain by assessing its similarity with a reference text or resource. Another feature evaluates the interpretability of the model (through the number of labels and definitions) and a semantic F1 score.
The second part of the contribution is the experimental validation of the metrics. The evaluation setup includes testing the metrics thanks using 4 knowledge graphs extracted from the following 4 knowledge graphs: a small example taken from ESCO ontology, ESCO itself, MeSH and the Pizza Ontology. 3 sub-graphs are extracted from ESCO and 3 from MESH. 2 experiments are performed to evaluate 2 hypotheses so as to test the relevance of the syntactic and semantic metrics: H1 assumes that removing information from the KG decreases the syntactic scores; H2 expects semantic scores to be impacted (decreased) after randomly changing the labels of leaf nodes in the graph hierarchy. For the 4 graphs, the reference resource to evaluate semantic consistency is Wikipedia. The similarity between graph labels and Wikipedia pages is measured using BERT? e generic encoder model.
For each tested graph, 3 values are compared: the ground truth values of the metrics are calculated on the graph before any change is made; H1 values are calculated after removing labels, and H2 values are measured after randomly changing the labels of leaf nodes.
Results are rather similar for the 4 graphs, except the semantic metric for the MESH graph. Experiments clearly confirm hypothesis H1 whereas H2 is more complex to assert. Surprisingly, the semantic quality is better after changing labels in MESH. This result confirms that specialized graphs would require a specialized reference source rather than Wikipedia, and a specialized language model (like Med-BERT) rather than a generic one like BERT.
The paper concludes that specific measures are required to evaluate the quality of knowledge graphs during their evolution. Results confirm that information suppression reduces syntactic quality, whereas changing the semantic quality has different impact depending on the degree of specialization of the graph. Future works include using specialized reference resources and domain specific language models for domain specific knowledge graphs.
** Strengths of the paper:
. deal with an important issue in KG engineering: design a support tool to maintain the KG quality when changed
. this issue is not much studied in the state of the art, which makes the paper quite original.
. very large coverage of the state of the art
. on average clear and well presented, although some parts are not clear enough.
. solid experimental setup to evaluate two clearly stated hypotheses
. the proposed metrics are original contributions that validate one of tehypotheses
** Limitations of the paper
. State of the art
1) The state of the art reviews a very large number of works. It is fairly well structured, but each paragraph is very short and contains a high-level enumeration of features without any definition. Each paragraph reflects a different perspective that hardly compares with the others. In a journal article, one would expect a more detailed analysis of existing work and definitions of quality characteristics. In addition, the state of the art should be better structured, with each paragraph better articulated and indications given for comparison.
2) It is a pity that you do not present more precisely the metrics and features used in previous works to assert the quality of a KG or an ontology. Such a presentation would help to understand why you selected the criteria that you chose.
. From the current presentation, it is hard to understand why you decided to refer to DOORS and the criteria proposed by (Mc Daniel etc)
1) The motivations for choosing these criteria rather than others should be explicitly reported.
2) most criteria should be better defined, and at least the criteria used in (Mc Daniel and al) should be presented in detail, so as to understand better if the way you adapt and reuse them is relevant, and the nature of their adaptation.
3) You explain how you adapt them in section 3 but this adaptation is not really well justified. You motivate your choice very rapidly, and it is hard to understand why these criteria and your adaptations actually are better than other criteria to evaluate KG quality when this KG changes.
. Many questions arise about the definitions of the metrics and their calculation formulae;
1) The criterion of syntactic lawfulness requires rules and takes into account the number of "breached rules": what are these rules? where do they come from? I may have missed it, but you never mention rules as one of the component of a KG. Are these rules some constraints that must be satisfied by the data in the KG? are there external to the KG? It is all the more confusing that you do not use this criterion in your experiments because your KGs have no rules. So when is this criterion applicable? Where are "breached rule"? when do you consider that a rule is breached?
This is a major limitation in the paper that requires a major revision.
2) The criterion of syntactic structures relies on the difference between sub-class and class, which is not an obvious one. All sub-classes are classes by definition, and in an OWL ontology, all classes are sub-classes of owl:Thing. So it would be worth explaining when you consider a concept to be a class or a subclass. Moreover, you need to give the intuition behind this criterion and the formula that you propose to calculate it. As such, your definition seems arbitrary.
3) the criterion of Semantic of F1 id not clear: what do you mean by "definitions that also occur in the KG"? should the definitions be exactly the same in the graph and in the resource? or is the presence of a definition enough, whatever its content?
4) In the conclusion, you claim that you introduced "a comprehensive set of syntactic and semantic metrics". This sentence is a little stronger than the reality: how can you assert that this set is comprehensive? what clues, experiments can prove that it is comprehensive?
. Many questions arise about the experimental setup. The one that bother me most are the following:
1) Why did you choose to consider changes to the labels, and only to the labels? It is true that deleting labels is one way of degrading the quality of a KG, but deleting concepts or properties could be an alternative solution: why did not you consider such changes? It would be good idea to add a section explaining what types of changes you considered, and why you decided to focus on removing labels.
For instance, given the definition of the syntactic richness criterion, I would expect that one of the experiments consist in deleting attributes. Why did not you do so? whatever the reason, it would be helpful to better justify you choice.
2) It is surprising to consider that semantics can be evaluated using labels, when one of the motivations for building ontologies and knowledge graphs is to go beyond language and represent “knowledge” using richer representations such as concepts, properties and attributes. Of course, going from labels to BERT vectors is a step towards conceptualization. You need to explain and comment this.
3) Why did you decide that the external resource should be the same for all KGs? given that one of your experiments involves KGs extracted from MESH, it seems "obvious" that Wikipedia is too general and might be missing some MESH concepts or terms. Again, a discussion of what is a "good external resource" for the purposes of the semantic evaluation that you propose would be very helpful. Have you considered other resources such as KGs (Wikidata or Yago or Dbpedia)? or domain-specific terminologies? or domain-specific documents?
4) It's not clear why you chose the Pizza ontology. This ontology has been modified several times, giving rise to various versions, and N. Fridman Noy has published several articles on evolutionary issues using this ontology as a test set. It would have been interesting to test your measurements on two or more versions of this ontology. So far, your motivation is for testing your approach with this ontology needs to be more detailed.
. Code and data availability: I could not find a link towards the code and the datasets used for the experiments. Please add such information.
Given all these unanswered questions, the paper needs to be thoroughly rewritten before it is accepted. I suggest that the authors revise it significantly, perhaps improve their experiments, and submit the new document at a later date.
** Details:
Section 3.1: 2 sentences are repeated with minor modifications: "While existing ... Table 1" is almost the same as "The metrics are repeated ... Table 1".
Section 4: This work expandec ... and introduced -> This work expands ... and introduces ...
In hypothesis 2 (section 4.5) it is odd, or even meaningless, to talk about leaves in a graph. You should better either explain what you mean by a leaf in a graph, or adapt your vocabulary and use another word. In both cases, a definition is required.
In section 5, you call Ground Truth the values obtained with the KG before it is changed. This is an unusual way to use "ground truth". It would deserve a comment and a clear definition in section 5.1. You use GT in table 5 without explaining that GT means Ground Truth there.
reference 26 : replace "et al." by the entire list of the authors.
references 35 and 36 are the same.
references 83 and 84 are the same.
|