Knowledge Graphs for Enhancing Transparency in Health Data Ecosystems

Tracking #: 3125-4339

Maria-Esther Vidal
Ahmad Sakor
Samaneh Jozashoori
Emetis Niazmand
Disha Purohit
Enrique Iglesias
Fotis Aisopos
Dimitrios Vogiatzis
Ernestina Menasalvas
Alejandro Rodriguez Gonzalez
Guillermo Vigueras
Daniel Gomez-Bravo
Maria Torrente
Roberto Lopez
Mariano Provencio Pulla
Athanasios Dalianis
Ana Triantafillou
Georgios Paliouras

Responsible editor: 
Guest Editors SW Meets Health Data Management 2022

Submission type: 
Full Paper
Tailoring personalized treatments demands the analysis of a patient's characteristics, which may be scattered over a wide variety of sources. These features include family history, life habits, comorbidities, and potential treatment side effects. Moreover, the analysis of the services visited the most by a patient before a new diagnosis and the type of requested tests, may uncover patterns that contribute to earlier disease detection and treatment effectiveness. Built on the concept of knowledge-driven ecosystems, we devise DE4LungCancer, a data ecosystem of health data sources for lung cancer. Knowledge extracted from heterogeneous sources, e.g., clinical records, scientific publications, and pharmacologic data, is integrated into knowledge graphs. Ontologies describe the meaning of the combined data, and mapping rules enable the declarative definition of the transformation and integration processes. Moreover, DE4LungCancer is assessed in terms of the methods followed for data quality assessment and curation. Lastly, the role of controlled vocabularies and ontologies in health data management is discussed and their impact on transparent knowledge extraction and analytics. This paper presents the lesson learned in the DE4LungCancer development and demonstrates the transparency level supported by the proposed knowledge-driven ecosystem in the context of the lung cancer pilots in the EU H2020 funded project BigMedilytic, the ERA PerMed funded project P4-LUCAT, and the EU H2020 CLARIFY.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 07/Jul/2022
Major Revision
Review Comment:

The paper presents a knowledge-driven data ecosystem of health data for lung cancer, named DE4LungCancer, which enables the integration of heterogeneous biomedical data sources (e.g., clinical records, scientific publications, and pharmacologic data) and provides a knowledge graph that enables various data analytical methods. The paper presents some of these methods, and reports on the outcomes that have motivated the execution of clinical interventions to enhance treatment effectiveness and lung cancer patients’ quality of life.
The ecosystem presented is comprehensive and valuable. As a general comment, the current version of the paper covering the whole work done to develop the knowledge-driven ecosystem risks to overwhelm the reader with a too rich content to be fully appreciated. It might be beneficial to focus the paper on some aspects of the work and provide the reader with all the relevant pieces of information with respect to this (e.g., focus on a detailed description of the implementation of the ecosystem and all its components, while keeping aside the example application). As an example, when considering the discovery of drug-drug interaction, the authors do not mention any quantitative measures to evaluate the results of the method developed, nor they refer any other works they have done on the matter. In order to increase the soundness and credibility of their statements, the authors should include more details on this task.

Other comments pertain to the the clarity and structure of the various sections of the paper. More in detail:
- The difference and relationship between Section 3.1 and Section 3.2 should be better articulated.
- A cleared discussion about how the requirements introduced in section 2.1 have been tackled would be beneficial
- The data quality section should be better introduced
- In the clinical data ecosystem, the authors state that the data are anonymized (pag.5, raw 39). Is this a complete anonymization or pseudonymization? In case of complete anonymization, how are the data of the same person grouped?

Other (minor) comments:
- The UMLS acronym is introduced twice
- CUI acronym seems to be missing (pag 14 – raw47)
- The definition of p in e=(q,p,k) is missing (pag 16 – raw 33)

Review #2
By Stelios Sfakiannakis submitted on 15/Aug/2022
Major Revision
Review Comment:

This is an interesting paper presenting DE4LungCancer as a data ecosystem (DE) aiming to address the data management, clinical, and ethical and legal requirements of the lung cancer pilot of the H2020 EU projects BigMedilytics and CLARIFY, and in the EraMed project P4-LUCAT. DE4LungCancer consists of three DEs that process and analyzes the pilot datasets (clinical DE, Scholarly DE, and Scientific Open Data DE). DE4LungCancer offers a semantic layer composed of a unified schema, biomedical ontologies, and mapping languages; they provide the basis for a transparent data integration process into a knowledge graph (KG).

The paper proposes Semantic Web "compliant" approach using technologies like RML, FnO, SHACL, UMLS, RDFS, etc to support data integration, curation/quality management, and knowledge graph creation. My overall comment is that the paper has a over the average originality and good quality of writing with moderate significance of results. The most important observations relates to the status and the availability of this knowledge graph (KG): although parts of it seem to be accessible (e.g. SPARQL endpoint for the mapping rules and the online documentation/visualization of the unified schema) we are not sure whether the system is currectly operating (at some pilot sites from the linked EU projects?) and what is its status and acceptance by the relevant stakeholders. It seems that this KG is integrated within a bigger system offering dashboard and statistical analyses functionalities but the paper is a bit unclear on how this integration is achieved, what are its users (oncologists?), and the relevant scenarios (the clinical KPIs??)

On more specific items:

* The context of the research can be framed more concretely in my opinion. For example in Introduction we read "electronic health records (EHRs) comprise unstructured clinical notes expressed in Spanish" which is of course not universally true. Also there are cases where existing EHR offer structured access to their content e.g. using HL7 standards and SNOMED/LOINC vocabularies so data integration should be simpler with no (much?) NLP processing.

* The size of the clinical data used is not clear. On par. 3.1.1 authors say that data of "more than 1,300 lung patients" have been collected and just a few lines below we read "Raw data: 988 EHR of patients from 2008-2020". Later on, on par. 3.2.1 (Datasets) we read about "1,042 EHRs" and then on par. 5.2 authors write "We retrieved all the properties of 1,051 patients from the DE4LungCancer KG". So please be specific and clear about the number of cases and the filtering performed to collect the final number of patients.

* On pg 7 lines 18-20 authors mentioned "h1v, plasmodium, xenopus, saccpomb, rattus, bos, celegans", "cooc with labels" without any further link. What is h1v ? (HIV maybe??) Why use c elegans PPI network since we are interested in human genomes?

* On par. 3.2.2 (pg 8) a link to the unified schema is provided in the VoCol system ( Personally, I wasn't able to check much of the documentation of the unified schema on the online "VoCol" site, I get some errors on the Browser console. Also, at lines 31-39 there's not much point on describing what VoCol does (I understand this tool was used for the development of the schema), the citation to the publication is enough in my opinion.

* On par. 3.2.3 the paper provides some details on the mapping rules. Listings 1 and 2 give SPARQL queries for retrieving those mappings and I expect knowledgeable readers would prefer to have some example of the mappings themselves instead.

* The use of the mappings is not clear to me. On paragraph 3.2.1 we read that clinical data are structured in JSON while Scholarly data (i.e. publications) are in a Neo4J graph. What are the pipelines used in order all these data be integrated and unified under the common schema using the RML/R2ML mapping rules?

* Authors perform some (initial?) statistical analysis for identifying patterns on hospital services used by patients before their diagnosis. On pg 18 and 19 there's comparison using Spearman Rho and Jaccard index for the periods 0-1, 0-4, 4-12, 4-13, 4-14, and 4-15 Have the authors taken into account that these periods are actually overlapping (i.e. 4-15 includes 4-12, 4-13,..) and therefore they are not independent?

* Pg 10, line 8 "9,4 mapping rules (standard deviation 16,4)" : numbers should use the GB/US English decimal delimiters (i.e "9.4 mapping rules (standard deviation 16.4")