Harmonizing Heterogeneous Data: A Materialized Knowledge Graph Approach for Federated Information Systems

Tracking #: 3635-4849

Authors: 
Abid Ali Fareedi
Ahmad Ghazawneh

Responsible editor: 
Oshani Seneviratne

Submission type: 
Full Paper
Abstract: 
Healthcare is one of the major industries where sharing information with a common understanding is essential. This research presents a federated knowledge graph framework with its materialized knowledge graph (MKG) approach. It also explores the transformative potential of federated knowledge graphs (FKGs) as a solution to address the persistent challenges of data integration and interoperability in today's complex information landscape, especially within federated information system development (ISD). The authors tailored Design Science Research Methodology (DSRM) to develop design artifacts embedded with a core domain-oriented ontological metadata model focused on cardiovascular diseases. By adopting FKGs, organizations can streamline data integration efforts, improve cross-system data harmonization, and foster seamless data exchange with metadata standardization among diverse information sources. This research highlights the advantages of FKGs in enhancing data connectivity with semantic alignment, seamless data transformation and interlinking, facilitating on-demand data retrieval to support optimization of query performance, scalability, and more effective decision-making processes within healthcare settings.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sola S. Shirai submitted on 12/Apr/2024
Suggestion:
Reject
Review Comment:

This paper explores the topic of interoperability and integration of heterogenous data from various sources, particularly in the domain of healthcare, and targets the following research question: "How does the federated knowledge graph framework improve data integration and interoperability issues using a materialized knowledge graph approach in federated information systems (IS)?"
The main claimed contribution of the paper appears to be a framework of using "materialized federated knowledge graphs" as an approach to perform integration of healthcare data.

While I believe that the paper does a good job of presenting the overarching challenges surrounding data integration and making use of existing tools for various knowledge graph / ontology related tasks, I also think that it has major issues which make it unsuitable for publication.

The organization and writing of the paper makes it very unclear what exactly the claimed contributions of the paper are. It is not clear what the authors intend to mean by "materialized knowledge graphs" (MKG), "federated knowledge graphs" (FKG), or "materialized federated knowledge graphs" (MFKG) - MKG is not an established concept as far as I'm aware, and if it's intended to be a new concept introduced by the authors, it is not properly defined or differentiated from other approaches. My own understanding of "materialization" would lead me to think that an MKG would just be making use of query or reasoning results - this seems more like just applying a new label to an existing and well established approach to utilizing KGs. The authors' use and interpretation of "federation" also does not match my understanding of the concept, and throughout the presented work it appeared more like various sources of data / ontologies were being imported and aligned rather than performing any federation.

Furthermore, it is not clear what the novelty of the authors' approach is. As far as I can tell, this paper seems to just describe a process of looking into various data sources, manually performing mapping/alignment of concepts among them, then integrating the results together. I do not see any evidence that the authors' approach to performing this process is novel, differentiation from other prior approaches, or evidence that this approach solves problems that prior approach cannot. The paper's associated github resource is also just images (which are included in the paper) and the ontology / alignment RDF used in their example - this further indicates to me that the authors' approach is primarily just performing manual mappings between various heterogenous data sources, rather than introducing some more generalizable or automated approach.

While this paper looks like it could be an interesting example of how to use semantic web tools to align and integrate data on a toy problem surrounding cardiovascular diseases, I do not believe that it presents original or significant research findings.

Review #2
By Sabbir Rashid submitted on 11/Aug/2024
Suggestion:
Reject
Review Comment:

1. Introduction
Statistics regarding the growth of digital data are unnecessary since they are questionable, and the storage of data isn't crucial to the argument being made.
RQ -
a) I assume this stands for the research question, but you need to expand on this acronym here.
b) In fact, the use of an acronym is unnecessary here.
c) The research question can be rephrased for clarity - “How does the federated knowledge graph framework improve data integration and interoperability issues using a materialized knowledge graph approach in federated information systems (IS)?” -> “How does using a materialized knowledge graph approach improve data integration and interoperability issues in federated information systems?”
Virtual Knowledge Graph - When you introduce the term, expand on what is meant by a virtual knowledge graph. How is it different from a regular knowledge graph?
Materialized Knowledge Graph - When you introduce the term, expand on what is meant by materialized knowledge graph and materialized federated knowledge graph.

2. Theoretical Background
2.2 1st Sentence - the items in the enumeration don’t need to be capitalized. This likely also applied to other in-paragraph enumerations in the paper. - “...into three main classes: 1) Data-level heterogeneity, 2) Ontological heterogeneity, and 3) Temporal heterogeneity…” -> “...into three main classes: 1) data-level heterogeneity, 2) ontological heterogeneity, and 3) temporal heterogeneity…”
2.1 Paragraph 1, 3rd to last sentence - “data set” -> “dataset”
2.2 Paragraph 2 - unnecessary capitalization: “Semantic interoperability” -> “semantic interoperability”
2.2 Paragraph 2, second to last sentence - unnecessary capitalization in enumeration: “...into three further main categories: 1) Classification, vocabulary, and terminology standards; 2) Data interchange standards; and 3) Health record content standards.” -> “...into three further main categories: 1) classification, vocabulary, and terminology standards; 2) data interchange standards; and 3) health record content standards.”
2.3 Paragraph 1 - unnecessary enumeration: “...present at two distinct levels: 1) The system level and 2) The semantics level.” -> “...present at two distinct levels, the system level and the semantics level.”
2.3 Paragraph 1 - incorrect usage of a proper noun: “Unknown-formats” -> “unknown formats”
2.3 Paragraph 1, last sentence - missing Oxford comma: “...by utilizing proprietary bridges, adapters, wrappers or common intermediaries…” -> “...by utilizing proprietary bridges, adapters, wrappers, or common intermediaries…”
Note: I’m not going to list out every instance of missing Oxford commas. Sometimes you use them, sometimes you don’t. Either always use them or never use them, but don’t switch between usage preferences.
3. Methodology
Paragraph 1, Sentence 4 - run-on sentence, consider shortening or splitting into multiple sentences: “This strategy helps to understand contextual knowledge within the healthcare domain (e.g., cardiovascular ontological metadata model) and other data models and their integration and knowledge management to facilitate end-users with shared understanding and unification of the data to resolve the data interoperability and data exchange issues.”
Note regarding footnotes: I’m not going to list out every time this occurs as it happens maybe 10 times throughout the paper, but footnotes go after, not before, the punctuation. Please update.
Note regarding figures: Texts in many of the figures tend to be very small, making them difficult to read, especially in a printed version of the document. Update Figures 1, 2, 5, 6, 8, 10, 11, 14, and 16.
3.3 - You list out written numbers for each bullet, so you might as well enumerate here.
4. Experimental Results
One of the main contributions of this work is the proposed Cardiovascular Disease Ontology (CVO). There is already an ontology by this name (CVDO) which was first published in 2016/2017. This existing ontology is not cited at all. This is the first major flaw of this paper. If you propose a new cardiovascular disease ontology, you must highlight the difference and why the new ontology is necessary.
Many of the classes in CVO are plural (e.g. Cardiovascular_Diseases, Cardiac_Disease_Causes, Recommendations, Risk_Factor_Indicators, etc.). While this is not forbidden, it is generally thought of as a bad practice.
4.1.2 - “Virtual Semantic View Framework” is the title of the subsection so the expansion of the acronym “VSVF” can be inferred, but technically it is not introduced before the usage of the acronym.
4.1.3 - typo in the first sentence but also unnecessary punctuation: “Table 3., delineates…” -> “Table 3 delineates…”
This happens more than once but I’ll just mention it here. An article is often missing when using the word “ontology.” For example, in the first sentence of this section: “...the process of converting structured relational data sources, such as relational databases, into ontology…” -> “the process of converting structured relational data sources, such as relational databases, into an ontology…” Please fix missing article usage throughout the paper.
4.4 - typo: “Pallet” -> Pellet”
4.4 - typo in the last sentence, a period is inserted twice
5. Evaluation and Testing
Combining multiple knowledge graphs is not federation. Combining multiple ontologies is not federation. Federation involves querying from multiple sources of data in an integrated way without necessarily mutating or combining the data sources. This is another major flaw of this paper. All the queries presented are regular queries. Where are the federated queries? Despite the title of the article, you are not actually doing federation.

Acronym usage -
a) You don't need to and arguably shouldn't include acronyms in the abstract.
b) Don't introduce an acronym more than once; just include it the first time the phrase is introduced and subsequently use the acronym.
c) If you use a phrase only once, most likely you don't need to introduce the acronym for it.
d) Don't use acronyms that you haven't introduced.

Third-person voice - Don't refer to yourselves as "the authors" and instead use a first-person pronoun such as "we." I won’t list all the times this happens, but it should be updated throughout the paper.

Decision: Reject
The two major flaws of this paper are the following:
The ontology proposed has already been published by another group. That existing ontology is neither cited nor compared against. Furthermore, upon inspection, the existing ontology uses better ontology design patterns than the one presented here.
The title of this paper is misleading since the approach does not implement any federation.

Review #3
Anonymous submitted on 20/Aug/2024
Suggestion:
Major Revision
Review Comment:

The paper presents a paradigm for harmonizing heterogeneous data by creating knowledge graphs from federated information systems. The authors claim this approach. The biggest downside of the work is the lack of novelty (and this is significantly due to the fact the paper is not focused on a well-defined research gap – more details below). The employment of knowledge graphs to tackle data integration, normalization, and harmonization is very well documented in the literature, especially for the health domain, where the authors test their approach. However, challenges still exist. This paper would greatly benefit from having those outstanding challenges scoped with focused citations.

Although the authors tried to focus on health data, the problem space of the paper is extensive, and the proposed solution needs to tackle specific identified challenges. The proposed solution is presented at a very high level, including elements related to data modeling, ontology matching/aligning, data governance, and more. Due to this, the paper is not easy to read, as many concepts are cited without an apparent relevance to the problem space.

This work would be better suited to a venue more concerned with data governance. Even so, the paper would need to be significantly revised to be considered.

Notably, the article needs to acknowledge more approaches that tackle the parts that compose their proposed solution. For example, RML, a W3C candidate recommendation, could solve parts of the problem. Several existing approaches for ontology matching and alignment are also able to accomplish some of the claims in the paper.

I suggest that the authors restructure the paper to focus on specific problems. For that, the following needs to be accomplished:
- Section 1 needs to be shortened and the problem clearly introduced. The authors digress, for example, on details about human data production, which is not necessary. This section contains many references that could be in a Related Work section which needs to be added to the article. Most of all, the problem needs to be clearly introduced. The authors present one research question that involves “data integration and interoperability.” These are very broad areas of research.
- Section 2 (Theoretical Background) should be renamed to Related Work or similar and focus on the identified research problems and cite relevant works toward solving the problem. It only lists and describes several works not directly related to the authors' proposal.
- Section 5 (Evaluation and Testing) requires significant improvement. The authors claim they can fulfill competency questions, but these need to be clearly defined or listed somewhere in the paper. This will help to demonstrate the effectiveness of their approach.
- Section 6 (Discussion) should provide a comparison of the results of the approach with the state-of-the-art. It should highlight the novelties the approach brings, with clear metrics, and also discuss its shortcomings. Where does this approach fall short compared to existing ones?