Review Comment:
This is an interesting paper presenting DE4LungCancer as a data ecosystem (DE) aiming to address the data management, clinical, and ethical and legal requirements of the lung cancer pilot of the H2020 EU projects BigMedilytics and CLARIFY, and in the EraMed project P4-LUCAT. DE4LungCancer consists of three DEs that process and analyzes the pilot datasets (clinical DE, Scholarly DE, and Scientific Open Data DE). DE4LungCancer offers a semantic layer composed of a unified schema, biomedical ontologies, and mapping languages; they provide the basis for a transparent data integration process into a knowledge graph (KG).
The paper proposes Semantic Web "compliant" approach using technologies like RML, FnO, SHACL, UMLS, RDFS, etc to support data integration, curation/quality management, and knowledge graph creation. My overall comment is that the paper has a over the average originality and good quality of writing with moderate significance of results. The most important observations relates to the status and the availability of this knowledge graph (KG): although parts of it seem to be accessible (e.g. SPARQL endpoint for the mapping rules and the online documentation/visualization of the unified schema) we are not sure whether the system is currectly operating (at some pilot sites from the linked EU projects?) and what is its status and acceptance by the relevant stakeholders. It seems that this KG is integrated within a bigger system offering dashboard and statistical analyses functionalities but the paper is a bit unclear on how this integration is achieved, what are its users (oncologists?), and the relevant scenarios (the clinical KPIs??)
On more specific items:
* The context of the research can be framed more concretely in my opinion. For example in Introduction we read "electronic health records (EHRs) comprise unstructured clinical notes expressed in Spanish" which is of course not universally true. Also there are cases where existing EHR offer structured access to their content e.g. using HL7 standards and SNOMED/LOINC vocabularies so data integration should be simpler with no (much?) NLP processing.
* The size of the clinical data used is not clear. On par. 3.1.1 authors say that data of "more than 1,300 lung patients" have been collected and just a few lines below we read "Raw data: 988 EHR of patients from 2008-2020". Later on, on par. 3.2.1 (Datasets) we read about "1,042 EHRs" and then on par. 5.2 authors write "We retrieved all the properties of 1,051 patients from the DE4LungCancer KG". So please be specific and clear about the number of cases and the filtering performed to collect the final number of patients.
* On pg 7 lines 18-20 authors mentioned "h1v, plasmodium, xenopus, saccpomb, rattus, bos, celegans", "cooc with labels" without any further link. What is h1v ? (HIV maybe??) Why use c elegans PPI network since we are interested in human genomes?
* On par. 3.2.2 (pg 8) a link to the unified schema is provided in the VoCol system (http://ontology.tib.eu/bigmedilytics/). Personally, I wasn't able to check much of the documentation of the unified schema on the online "VoCol" site, I get some errors on the Browser console. Also, at lines 31-39 there's not much point on describing what VoCol does (I understand this tool was used for the development of the schema), the citation to the publication is enough in my opinion.
* On par. 3.2.3 the paper provides some details on the mapping rules. Listings 1 and 2 give SPARQL queries for retrieving those mappings and I expect knowledgeable readers would prefer to have some example of the mappings themselves instead.
* The use of the mappings is not clear to me. On paragraph 3.2.1 we read that clinical data are structured in JSON while Scholarly data (i.e. publications) are in a Neo4J graph. What are the pipelines used in order all these data be integrated and unified under the common schema using the RML/R2ML mapping rules?
* Authors perform some (initial?) statistical analysis for identifying patterns on hospital services used by patients before their diagnosis. On pg 18 and 19 there's comparison using Spearman Rho and Jaccard index for the periods 0-1, 0-4, 4-12, 4-13, 4-14, and 4-15 Have the authors taken into account that these periods are actually overlapping (i.e. 4-15 includes 4-12, 4-13,..) and therefore they are not independent?
* Pg 10, line 8 "9,4 mapping rules (standard deviation 16,4)" : numbers should use the GB/US English decimal delimiters (i.e "9.4 mapping rules (standard deviation 16.4")
|