Publishing DisGeNET as Nanopublications

Tracking #: 1050-2261

Núria Queralt-Rosinach
Tobias Kuhn
Christine Chichester
Michel Dumontier
Ferran Sanz
Laura I. Furlong

Responsible editor: 
Boyan Brodaric

Submission type: 
Dataset Description
The increasing and unprecedented publication rate in the biomedical field is a major bottleneck for discovery in Life Sciences. Although the scientific community is limited an inability to manually curate facts from published papers, recent approaches enable the automatic, scalable and reliable extraction of assertions from the scientific literature. While the publication of assertions on the Semantic Web is gaining traction, it also creates new challenges to ensure proper provenance, such as versioning for dataset change-sensitive link generation. Here, we address these issues and describe our efforts to represent the DisGeNET database of human gene-disease associations as permanent, immutable, and provenance rich digital objects called nanopublications. This is the first Linked Dataset that ensure stable interlinking to the assertion and its metadata by trusty URIs. As DisGeNET integrate expert-curated and text-mined data of different origin, the semantic description of the evidence for each assertion is provided to confer trust and allow evidence-based hypothesis generation. We describe our steps to ensure high quality and demonstrate the utility of linking our dataset to others on the emerging Semantic Web.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Amrapali Zaveri submitted on 23/Apr/2015
Review Comment:

The authors have addressed my previous comments and now the paper can be accepted. However, there are some sentences which need to be rephrased and tightened. I am listing a few here but I would strongly recommend the paper to be read by a native speaker:
- "scientific community is limited an inability" - please rephrase
- "dataset change-sensitive link generation." - please rephrase
- "As DisGeNET integrate expert-curated and text-mined data of different origin, the semantic description of the evidence for each
assertion is provided to confer trust and allow evidence-based hypothesis generation." - "integrates", "origins"
- "and healthcare" - "and healthcare research" (?)
- "public different databases" - "different public databases"
- "In addition, linkouts to the LOD are set in order to both..." - please rephrase the whole sentence
- "The interlinking is derived from the cross-references provided by the source databases." - this still sounds unclear to me. What tool did you use to produce these interlinks?
- "how has to be formally represented" - "how it has to be formally represented"
- "the source database from which was derived" - "the source database from which it was derived"
- "using as a base the DisGeNET namespace ( )" - "using the DisGeNET namespace ( ) as a base"
- "in TriG syntax" - "using the TriG syntax"
- "DisGeNET nanopublications can be accessed in three ways: " - but only two ways are mentioned.
Availability, Production and Sustainability
- "serialize in the recommended TriG syntax our nanopublications" - "serialize our nanopublications in the recommended TriG syntax "
- "RGD, CTD mouse and rat datasets, and our literature-mined BeFree dataset, and new annotations regarding the level of evidence of each data source as it is highlighted in Table 1." - "and" twice in the sentence; also "as highlighted in Table 1"
- "The versioning track for nanopublications consists of keeping track of the version’s provenance of both for the RDF and so for the relational version of Dis GeNET, from which the RDF is derived." - please rephrase "of both for the" and "and so for"
Related Work
- " In this work, it is underlined the importance of including appropriate provenance and context information to avoid confusion to data consumers" - " In this work, the importance of including appropriate provenance and context information is underlined to avoid confusion to data consumers"