Converting neXtProt into Linked Data and nanopublications

Tracking #: 461-1638

Christine Chichester
Oliver Karch
Pascale Gaudet
Lydie Lane
Barend Mons
Amos Bairoch

Responsible editor: 
Oscar Corcho

Submission type: 
Dataset Description
The development of Linked Data provides the opportunity for databases to supply extensive volumes of biological data, information, and knowledge in a machine interpretable format to make previously isolated data silos interoperable. To increase ease of use, often databases incorporate annotations from several different resources. Linked Data can overcome many formatting and identifier issues that prevent data interoperability, but the extensive cross incorporation of annotations between databases makes the tracking of provenance in open, decentralized systems especially important. With the diversity of published data, provenance information becomes critical to providing reliable and trustworthy services to scientists. The nanopublication system addresses many of these challenges. We have developed the neXtProt Linked Data by serializing in RDF/XML annotations specific to neXtProt and started employing the nanopublication model to give appropriate attribution to all data. Specifically, a use case demonstrates the handling of post-translational modification (PTM) data modeled as nanopublications to illustrate the how the different levels of provenance and data quality thresholds can be captured in this model.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Amrapali Zaveri submitted on 23/May/2013
Major Revision
Review Comment:

The paper "Converting neXtProt into Linked Data and nano publications" is focused towards the serialization of annotations specific to nextProt, a protein knowledge platform, as well as incorporating the nanopublication approach to provide provenance information. A use case demonstrating the handling of post-translational modification data modeled as nanopublications is explained to illustrate how the different levels of provenance and data quality thresholds can be captured in this model.

The converted dataset uses a huge amount of established vocabularies to model the data, which is aligned with the principle of re-using existing vocabularies. The data is also made available for download as an RDF dump. The conversion of this data using the nanopublications model does seem a reasonable and useful effort considering the huge amount of useful information that is available.

However, there is key information that is lacking in the paper and the dataset itself
- making the data available via a SPARQL endpoint
- more use cases
- interlinks to other external dataset (even to the RDF version of UniProtKB) or other datasets: and use of these links to obtain further information
- a VoiD description of the dataset including the versioning and licensing information
- update mechanism as well as policies to ensure sustainability and stability

About the conversion process, why was it necessary to transform the XML data to a relational data-model? Wasn't a conversion from XML to RDF (via XSLT) possible? What were some of the errors that were encountered during the conversion? Additionally, the one use case explained in this paper is not extremely clear. Providing a concrete example of an 'assertion' might help better understand the usage. Actual analysis of an author (or group or authors') scientific contribution could explain and illustrate the use case better (if possible). Also, why do the authors look into minting of URLs when talking about Linked Data where URIs are used? What are the known shortcomings of the dataset?

The paper is easy to read, however contains a few errors and needs clarification at certain places:
- Abstract: "…to illustrate the how the different…" - "…to illustrate how the different..."
- Introduction: "accessibly" - "accessible"
- Add references for the Open PHACTS project, UniProtKB Linked Data model, BioPAX ontology
- Expand the abbreviation PTM in the abstract and add the abbreviation at the first occurrence of the word in the text
- What does "standards based ontologies" mean?
- In Figure 1, what does RDM stand for? Relational data model? Please add the abbreviation in the text.
- Section 3: add reference [9] in the first paragraph itself instead of the third paragraph
- Section 4: "possibly" - "possibility"
- The last figure is incorrectly numbered figure 2
- Section 5: First paragraph in Conclusion is more suited for the introduction. In this conclusion section, the main contributions, limitations (if any) as well as future work should be discussed
- Section 5: "Nanopublications encoded in RDF can be more easily mined, queried and retrieved through the Internet…" I would rather say a SPARQL endpoint instead of just stating the "internet" in general !
- Reference 4 is not used anywhere in the paper
- Check formatting of reference 21

Review #2
By Prateek Jain submitted on 17/Jun/2013
Major Revision
Review Comment:

The work 'Converting neXtProt into Linked Data and nanopublications' presents the methodology for the conversion of nextprot data into Linked Data. The paper presents the methodology, the different schemas utilized, sample snippet and the publications which can benefit from the datasets.

The work can potentially prove to be useful for the biomedical community. However, I have some comments regarding the presentation and the material covered because of which I am recommending a major revision

1. Use cases besides nano publications are not entitely clear to me. How can it help the overall community and any contributions which are not being made by the other datasets isn't entirely clear.

2. How is the work different from other datasets in biomedical domain which are part of LOD? Some information regarding this will be useful and help the dataset stand out from others.

3. How is the gold, silver and bronze measure calculated? Details related to it will be interesting and useful.

4. Linked Data by itself does not promotes openness. Disclaimer: I was involved in this discussion, but please see

5. What is the future direction for the dataset, how do the authors plan on keep it updated, is it a single person effort to maintain it? Can the authors please include a subsection in the paper explaining the license utilized for the dataset and the motivation behind using the specific license?