Dynamic knowledge graph evaluation

Tracking #: 3739-4953

Authors: 
Roos Bakker
Maaike de Boer

Responsible editor: 
Marta Sabou

Submission type: 
Full Paper
Abstract: 
In a world where information is exchanged at an increasing pace, knowledge becomes quickly outdated. Formal constructs that capture human knowledge, such as knowledge graphs and ontologies, need to be updated and evaluated to stay relevant and functioning. However, manually updating and evaluating existing knowledge models is labour intensive and prone to errors. This study addresses the challenge of evaluating changes in existing knowledge graphs. In this work, syntactic and semantic metrics tailored for change evaluation are introduced. The metrics are implemented and tested through experiments on knowledge graphs across various domains. In these experiments, real-world changes are simulated by removing concepts and introducing faulty ones before measuring the quality with the syntactic and semantic metrics. The hypothesis is that such changes decrease scores: removing concepts influences syntactic qualities such as the structure of the model, while adding faulty concepts affects semantic qualities like model consistency. The results confirm the hypothesis, showing that the extent and nature of the changes influence the scores. Additionally, size and degree of specialisation of the graph affect the scores. Overall, this study presents a set of evaluation metrics and provides empirical evidence of their efficacy in assessing modifications to knowledge graphs from different domains.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 28/Mar/2025
Suggestion:
Reject
Review Comment:

# Summary:
This paper introduces an updated set of metrics to evaluate the quality of knowledge graphs.

The validation and experiments are poorly designed and do not test the validity of the metrics. The experiments only touch 2 of the 6 submetrics, and are not applied on real changes. If ontology evolution and its quality as the graphs change is the goal, then the authors need to evaluate on real changes and also evaluate all of their metrics, not just the ones that were easy to implement in the experiment by changing labels. The graphs which the authors evaluate also all have only a Tbox, no Abox and it is difficult to see how this work would translate to real-world KGs, which are much larger in size.

Therefore, I recomment reject and resubmit, after the authors redesign and execute the validation of the proposed metrics.

# Additional Feedback:

## 1. Introduction:
"This example can be viewed in a simplified graph form in Table 1." --> I think this should be Figure 1, not table.

## 2. Related work:
There are several related work, that I believe are missing:
- OntoEval (https://dl.acm.org/doi/abs/10.1145/3543873.3587318)
- KGHeartBeat (https://link.springer.com/chapter/10.1007/978-3-031-77847-6_3)
and on KG evolution in general: How Does Knowledge Evolve in Open Knowledge Graphs (https://drops.dagstuhl.de/entities/document/10.4230/TGDK.1.1.11)
This last paper will probably lead to you to other relevant papers, especially on ontology and KG evolution.

## 3. Evaluation metrics:
It looks like all metrics are build on the work of McDaniel et al. [81], but that is certainly not the only ontology evaluation metric paper you cite in the related work. I have a hard time believing that the measures are purely based on [81] and none other. If that is a case, a clear justification is needed, for both the syntactic and the semantic quality metrics, as to why you are building on this work and not on others.
- The quality metrics only take into account the Terminology (Tbox) of a KG. this is by design but you do not address this issue. KGs often have a large Abox but these metrics do not take that into account at all.

### Syntactic Quality Metrics:
- You do not introduce the variables which define your measures in table 1. It is very hard to understand the table. You need to introduce each metric in detail in the text and introduce its variables properly, so that it would be possible to implement them from the text alone.
- You use the variable s in both SyL and SyS and they do not mean the same thing. This should not happen when writing mathematical formulas in a paper.
- The formatting of the variables for the weights is not correct in the text (but is correct in the table)
- What is a rule? How can it be breached?

### Semantic Quality Metrics:
- ".., which is inspired previous work [81],.." --> there seems to be a preposition missing here, inspired by? inspried from?
- "The Semantic Consistency can be measured localy or over the complete knowledg graph" --> how is this supposed to be done? You need to explain this in your text.
- The formulas need to be explained in detail in the text. The reader needs to be able to understand the table after reading the text. Right now that is not the case. Having a table with a considerable amount of text doesn't really help in terms of understanding. The text should not be in the table, only the formula.
- "for future automation of the process" --> do you do not actually automate this part of the metrics but use manually gathered keywords. You need to add the details on how you do the gathering. This implies that you are not actually testing and validating this proposed approach, so why are you introducing?

## 4. Validation:
There are multiple problems with the valiadation, the most obvious being that you are not actually evaluating against the dynamicity of these graphs, or their evolution. You are introducing some minor *label* changes only.
The validation also only focuses on SyR and SeC, not even on the overall metrics, as the remaining metrics remain unchanged due to the validation only addressing the changes to labels.
You introduce some change but you do not provide a lot of details on this. Similarly it is unclear how large the different KGs are (they are actually mostly terminologies, as none of them have an Abox).

In general, the validation is whole unsufficient to show that the measures which you have introduced bring any value. Not only are the experiments designed to only validated SyR and SeC, you also do not apply your validation to KGs of various sizes and forms (no Abox). The sizes that you investigate are relatively small and you do not evaluated with real changes, which would have been possible by taking existing evolving KGs (like MeSH).
I will not provide more detailed feedback on Sections 5 and 6 because I don't believe that it is helpful to the authors at this point. The validation needs to be designed in a way that actually test the validity of what the authors claim the measures do, which is in its current form not the case.

Review #2
Anonymous submitted on 11/May/2025
Suggestion:
Major Revision
Review Comment:

Thank you for the opportunity to review the paper. The authors present an approach to evaluate knowledge graph changes with the objective to keep consistent graph semantic and structural consistency.

New metrics and measures are introduced to assess the semantic and structural state of knowledge graphs after a change is introduced. Experiments are performed to demonstrate the feasibility of the approach.
The paper addresses a very important and relevant aspect to the semantic web community. There are several aspects in the paper that require attention to make it fit for a journal publication.

At the level of the introduction, it is not clearly written. It is recommended to re-write the introduction to identify the gap in the literature that leads to the research question the paper is aiming to answer. In the second paragraph of the introduction, it is mentioned “Previous work on knowledge graph or ontology evaluation has often relied on manual evaluation methods, the existence of a ground truth, or the usage within an application.” First, references to related works should be added at this level. Second, this statement is not very accurate, as approaches on automatic ontology consistency and checking exist. Unless you are referring to very specific limitations that should be clarified.

When the methodology is introduced in the first line of Page 2, it is mentioned “score decreases”, however it is not clear what score is being referred to. It is recommended. After the gap and research question, introduce the method, metrics, etc., and how it’s different compared to the existing research in the field.

In the literature review section, Section 2.1 and 2.2 are very descriptive, it is recommended to be more analytical in fleshing out the gap and lessons learned. One potential way to address this is to create a table at the end of Section 2, to list the related works and what they focus on, and how the paper is addressing this gap.

In the proposed approach, some elements in the metrics are not well defined, e.g., “two weights are included (wsyr1, wsyr2)” what is y? subscript “s”? Also how are those weights set based on the “aspects” of the knowledge graph that are important? What is meant by aspects?

I suggest rewriting Tables 1 and 2, so that each metric has a clear “calculation”, followed by a well-described text fleshing out each variable. Currently things are mixed up leading to unclarity.

Another question I have about the approach is the reliance on Wikidata as an external validator for semantic interpretability. How feasible is this? E.g., imagine having instead of “music teacher” as a concept, it could be a relation such as: person > music teacher > Position. While they follow a different design approach, they are still highly semantically interpretable, and probably not match in Wikidata? More details at this level can help.

In the evaluation results, one major element that needs to be addressed is that there are some metrics that were not tested, e.g. “Syntactic Lawfulness” it is not advisable to have in a methodology proposed solutions that cannot be tested due for example to a lack of available data. There should be a way to measure and evaluate the proposed approach through a suitable way. If an ontology or dataset are not available, we could rely on a manual evaluation or other means for evaluations.

In footnote 1, it is better to include the implementation metrics and keywords in the paper as an appendix or online reference.

In the conclusion, it would be good to better articulate the research contributions.

The paper would benefit from a full thorough proof reading. Several minor comments were listed below, with probably more that I could not spot.

I hope the comments will help improve the paper.

Minor comments:
- Table 1 should be Figure 1, and its references should be fixed throughout the text.
- Section 2.1: this sentence is grammatically incorrect and not clear: “in which some levels are easier to approach evaluation that others”
- Section 3: “a data-drive approach” --> a data-driven approach
- When sections tables and figures are referenced in the text, first letter should be capitalized. E.g. sections 3.1 and 3.2 --> Sections 3.1 and 3.2
- Section 4: “this work expandec” --> this work expanded
- Section 4.3: The headers based on the italics are not well formatted. It is suggested to better format them. E.g. ending a “.” Or whatever the journal recommends.
- Section 4.5: “affect affect” --> affect
- Section 4.6: “applied too all” --> applied to all

Review #3
By Nathalie Aussenac-Gilles submitted on 28/May/2025
Suggestion:
Reject
Review Comment:

**summary of the paper

This paper deals with a key issue in ontology and knowledge graph (KG) management: the evaluation of their evolution. This evaluation is important to ensure that the ontology remains consistent over time, and relevant even after updates or maintenance actions. The authors propose several metrics that can characterize a KG and contribute to evaluate its dynamics, i.e. to qualify if the changes brought to this graph have a positive or a negative impact on its quality.

The paper starts with a review of related works, focusing on existing metrics to measure and evaluate graph quality and allow an automatic evaluation of their quality after some changes. They identify several kinds of metrics, bearing either on syntax, semantics or pragmatics. This review has motivated their research because it underlines that most quality evaluation metrics and methods reflect the status of the ontology without considering its changes and evolution.
They also acknowledge that most available tools for an automatic ontology quality evaluation (like OntoMetric, DOORS et OQuaRE) perform a static evaluation of the KG or ontology quality, and do not account for the evaluation of changes in these models.

They conclude that there is a the need for a new tool that would gather all quality and evolution metrics from various methods with a dynamic perspective, enabling to automatically measure if changes in the ontology or KG had a positive or a negative impact of its quality. Such metrics and tools could improve the management of ontology evolution, support the detection of changes that introduce errors or inconsistencies in the ontology

Then the paper presents the main two contributions: 2 sets of metrics (syntactic and semantic metrics) for assessing ontology quality adapted from the work by Mc Daniel et al, 2018. The syntactic feature evaluate formal quality of the graph and its structure, like the syntactic richness and structure. The adaptation of the syntactic features aims at normalizing the value of these features, and at integrating how much the graph is compliant with some rules, its richness and the hierarchical structure. The semantic features are modified in order to evaluate the semantic consistency, to evaluate how close the model reflects a domain by assessing its similarity with a reference text or resource. Another feature evaluates the interpretability of the model (through the number of labels and definitions) and a semantic F1 score.

The second part of the contribution is the experimental validation of the metrics. The evaluation setup includes testing the metrics thanks using 4 knowledge graphs extracted from the following 4 knowledge graphs: a small example taken from ESCO ontology, ESCO itself, MeSH and the Pizza Ontology. 3 sub-graphs are extracted from ESCO and 3 from MESH. 2 experiments are performed to evaluate 2 hypotheses so as to test the relevance of the syntactic and semantic metrics: H1 assumes that removing information from the KG decreases the syntactic scores; H2 expects semantic scores to be impacted (decreased) after randomly changing the labels of leaf nodes in the graph hierarchy. For the 4 graphs, the reference resource to evaluate semantic consistency is Wikipedia. The similarity between graph labels and Wikipedia pages is measured using BERT? e generic encoder model.

For each tested graph, 3 values are compared: the ground truth values of the metrics are calculated on the graph before any change is made; H1 values are calculated after removing labels, and H2 values are measured after randomly changing the labels of leaf nodes.

Results are rather similar for the 4 graphs, except the semantic metric for the MESH graph. Experiments clearly confirm hypothesis H1 whereas H2 is more complex to assert. Surprisingly, the semantic quality is better after changing labels in MESH. This result confirms that specialized graphs would require a specialized reference source rather than Wikipedia, and a specialized language model (like Med-BERT) rather than a generic one like BERT.

The paper concludes that specific measures are required to evaluate the quality of knowledge graphs during their evolution. Results confirm that information suppression reduces syntactic quality, whereas changing the semantic quality has different impact depending on the degree of specialization of the graph. Future works include using specialized reference resources and domain specific language models for domain specific knowledge graphs.

** Strengths of the paper:
. deal with an important issue in KG engineering: design a support tool to maintain the KG quality when changed
. this issue is not much studied in the state of the art, which makes the paper quite original.
. very large coverage of the state of the art
. on average clear and well presented, although some parts are not clear enough.
. solid experimental setup to evaluate two clearly stated hypotheses
. the proposed metrics are original contributions that validate one of tehypotheses

** Limitations of the paper

. State of the art
1) The state of the art reviews a very large number of works. It is fairly well structured, but each paragraph is very short and contains a high-level enumeration of features without any definition. Each paragraph reflects a different perspective that hardly compares with the others. In a journal article, one would expect a more detailed analysis of existing work and definitions of quality characteristics. In addition, the state of the art should be better structured, with each paragraph better articulated and indications given for comparison.
2) It is a pity that you do not present more precisely the metrics and features used in previous works to assert the quality of a KG or an ontology. Such a presentation would help to understand why you selected the criteria that you chose.

. From the current presentation, it is hard to understand why you decided to refer to DOORS and the criteria proposed by (Mc Daniel etc)
1) The motivations for choosing these criteria rather than others should be explicitly reported.
2) most criteria should be better defined, and at least the criteria used in (Mc Daniel and al) should be presented in detail, so as to understand better if the way you adapt and reuse them is relevant, and the nature of their adaptation.
3) You explain how you adapt them in section 3 but this adaptation is not really well justified. You motivate your choice very rapidly, and it is hard to understand why these criteria and your adaptations actually are better than other criteria to evaluate KG quality when this KG changes.

. Many questions arise about the definitions of the metrics and their calculation formulae;
1) The criterion of syntactic lawfulness requires rules and takes into account the number of "breached rules": what are these rules? where do they come from? I may have missed it, but you never mention rules as one of the component of a KG. Are these rules some constraints that must be satisfied by the data in the KG? are there external to the KG? It is all the more confusing that you do not use this criterion in your experiments because your KGs have no rules. So when is this criterion applicable? Where are "breached rule"? when do you consider that a rule is breached?
This is a major limitation in the paper that requires a major revision.
2) The criterion of syntactic structures relies on the difference between sub-class and class, which is not an obvious one. All sub-classes are classes by definition, and in an OWL ontology, all classes are sub-classes of owl:Thing. So it would be worth explaining when you consider a concept to be a class or a subclass. Moreover, you need to give the intuition behind this criterion and the formula that you propose to calculate it. As such, your definition seems arbitrary.
3) the criterion of Semantic of F1 id not clear: what do you mean by "definitions that also occur in the KG"? should the definitions be exactly the same in the graph and in the resource? or is the presence of a definition enough, whatever its content?
4) In the conclusion, you claim that you introduced "a comprehensive set of syntactic and semantic metrics". This sentence is a little stronger than the reality: how can you assert that this set is comprehensive? what clues, experiments can prove that it is comprehensive?

. Many questions arise about the experimental setup. The one that bother me most are the following:
1) Why did you choose to consider changes to the labels, and only to the labels? It is true that deleting labels is one way of degrading the quality of a KG, but deleting concepts or properties could be an alternative solution: why did not you consider such changes? It would be good idea to add a section explaining what types of changes you considered, and why you decided to focus on removing labels.
For instance, given the definition of the syntactic richness criterion, I would expect that one of the experiments consist in deleting attributes. Why did not you do so? whatever the reason, it would be helpful to better justify you choice.
2) It is surprising to consider that semantics can be evaluated using labels, when one of the motivations for building ontologies and knowledge graphs is to go beyond language and represent “knowledge” using richer representations such as concepts, properties and attributes. Of course, going from labels to BERT vectors is a step towards conceptualization. You need to explain and comment this.
3) Why did you decide that the external resource should be the same for all KGs? given that one of your experiments involves KGs extracted from MESH, it seems "obvious" that Wikipedia is too general and might be missing some MESH concepts or terms. Again, a discussion of what is a "good external resource" for the purposes of the semantic evaluation that you propose would be very helpful. Have you considered other resources such as KGs (Wikidata or Yago or Dbpedia)? or domain-specific terminologies? or domain-specific documents?
4) It's not clear why you chose the Pizza ontology. This ontology has been modified several times, giving rise to various versions, and N. Fridman Noy has published several articles on evolutionary issues using this ontology as a test set. It would have been interesting to test your measurements on two or more versions of this ontology. So far, your motivation is for testing your approach with this ontology needs to be more detailed.

. Code and data availability: I could not find a link towards the code and the datasets used for the experiments. Please add such information.

Given all these unanswered questions, the paper needs to be thoroughly rewritten before it is accepted. I suggest that the authors revise it significantly, perhaps improve their experiments, and submit the new document at a later date.

** Details:

Section 3.1: 2 sentences are repeated with minor modifications: "While existing ... Table 1" is almost the same as "The metrics are repeated ... Table 1".

Section 4: This work expandec ... and introduced -> This work expands ... and introduces ...

In hypothesis 2 (section 4.5) it is odd, or even meaningless, to talk about leaves in a graph. You should better either explain what you mean by a leaf in a graph, or adapt your vocabulary and use another word. In both cases, a definition is required.

In section 5, you call Ground Truth the values obtained with the KG before it is changed. This is an unusual way to use "ground truth". It would deserve a comment and a clear definition in section 5.1. You use GT in table 5 without explaining that GT means Ground Truth there.

reference 26 : replace "et al." by the entire list of the authors.
references 35 and 36 are the same.
references 83 and 84 are the same.