SpecINT: A framework for data integration over cheminformatics and bioinformatics RDF repositories

Tracking #: 1745-2957

Authors: 
Branko Arsić
Marija Đokić-Petrović
Petar Spalević
Ivan Milentijević
Dejan Rančić
Marko Živanović

Responsible editor: 
Michel Dumontier

Submission type: 
Full Paper
Abstract: 
Many research centers and medical institutions have been accumulating a vast amount of various biological and chemical data over the past decade and this trend continues. Based on Linked Data vision, many semantic applications for distributed access to these heterogeneous RDF (Resource Description Framework) data sources have been developed. Their improvements have brought about a decrease of intermediate results and optimizing query execution plans. But still many requests are unsuccessful and they time out without producing any answer. Also, the applications which operate over repositories taking into consideration their specificities and inter-connections are not available. In this paper, the SpecINT is proposed as a comprehensive hybrid framework for data integration and federation in semantic data query processing over repositories. The SpecINT framework represents a trade-off solution between automatic and user-guided approaches, since it can create queries which return relevant results, while not being dependent on human work. The innovativeness of the approach lays in the fact that the coordinates of graph eigenvectors are used for the automatic sub-queries joining over the most relevant data sources within repositories. In this way searching can be effected without a common ontology between resources. In experiments, we demonstrate the potential of our framework on a set of heterogeneous and distributed cheminformatics and bioinformatics data sources.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Alasdair J G Gray submitted on 19/Dec/2017
Suggestion:
Major Revision
Review Comment:

The writing in the paper has been vastly improved from the previous version. However, the structuring of the presentation of the work and the depth of the presentation are still problematic.

The paper presents an approach to generate SPARQL queries over a collection of datasets where there are query fragments for each dataset and sameAs links extracted from UniChem (not shown in architecture figure). The construction of the executed query is based on a selection of sources based on an eigenvector computation (still not fully explained, a more detailed explanation of the working example is needed), where the order of selection will affect the final result of the query. While this is an interesting line of enquiry, the theoretical underpinnings are limited to two datasets. This implies that the approach cannot be generalised to arbitrary set of distributed data sources.

The evaluation now attempts to explore the results of the generated queries, but still insufficient detail is given to understand and interpret the data. Very dated sources are used for the evaluation, e.g. Chem2Bio2RDF has not been updated since 2009 and uses ChEMBL version 8. How accurate and complete are the results? Can you compare your responses to those provided by other platforms, e.g. Open PHACTS or the EBI-RDF platform?

The authors claim that the computations are all completed on-the-fly over external SPARQL endpoints. Why then can't you use up-to-date data, e.g. the EBI-RDF platform contains the latest version of ChEMBL.

Introduction

The introduction is overly long and contains material that would be better placed in the related work section – in particular the discussion of related drug discovery platforms. The beginning of the introduction is repetitive and devoid of citations. The introduction should give a succinct overview of the work and its significance and highlight the contributions.

Related Work

The related work is generally dated and thus does not reflect the current state of the art. In particular, there has been a lot of recent work on query benchmarking, e.g. Linked Data Benchmark Council and HOBBIT. There is also the Performant SPARQL queries paper of Loizou, Angles, and Groth (2015 – 10.1016/j.websem.2014.11.003). The discussion of automated query generation at the top of column 2 on page 2 is missing any references. Where this is revisited on page 4, the citations are againa dated. There is more recent work, see ISWC 2017 for details.

There is no discussion of how the work compares with platforms such as the EBI-RDF platform or Open PHACTS. The discussion of Open PHACTS is incorrect. Open PHACTS does not pre-integrate the sources. It takes a local copy for performance reasons, but all data remains in its original form. Queries then extract relevant parts of each dataset based on contextualised instance equivalences retrieved from the Identity Mapping Service.

Section 3

This is poorly named, it is more a discussion of data sources and their equivalences.

Due to the writing it is unclear whether you are basing your equivalences on InChI relationships or synonyms. You should be aware of claims of equivalences, particularly if you are basing this on synonyms.

In listing 1 why are you using PIBAS sameAs rather than owl:sameAs?

Reference 36 is for an old conversion of ChEMBL to RDF. The EBI have built on this work and now publish ChEMBL as RDF data.

The second paragraph of Section 4 would fit better at the end of Section 3.

SpecINT Architecture

I would suggest making a clearer distinction of the call to UniChem for equivalence mappings from the other data sources in Figure 1.

Sub-query patterns

You should explicitly state what is needed for a pattern and how they are currently constructed.

How is the overlap of data content in the sources dealt with, particularly if they contain different (contradictory) values? Open pHACTS overcomes this problem by choosing which source to use for each of the values it returns. What is the equivalent design decision in SpecINT?

Data Source Selection

Provide links to full URIs returned by UniChem lookup. This will allow the reader to see which data sources are interlinked and how.

Figure 2, which is used to demonstrate and explain the approach, remains problematic as it is not explained. What do the red dots actually represent? Why do the labels of these dots have numbers? Why is the same substance occuring mutliple times? What was the seed URI used to generate the figure?

Graph Construction

Use your working example to explain your approach.

At the end of the section (page 8) you state "we developed a simple ontology which consists of information about data sources." Why didn't you use an existing standard ontology, VoID or DCAT? How does your ontology compare?

Listing 2

What is the mapping_node?

In the drugbank and chembl sub-queries there are instance URIs that should be bolded.

SpecINT Use Cases

This section does not contribute to this paper for the semantic web community. It could be summarised in a short paragraph as motivation in the introduction thus providing space for the fuller explanations of the approach.

Evaluation

The presentation of the experimental evaluations is improved but still lacks cruicial information. They do not provide sufficient details fo the experiment setup. That is,

- What is the purpose of the experiment?
- What data sources, and specifically version of data sources have been used?
- What queries have been used as input?
- What are the dependent and independent variables?
- What is the baseline that you are comparing against? That is, how do you know if you are improving over the current state of the art?

I would suggest that you publish full details and results at a repository such as figshare.

I would expect to see an evaluation that investigates

1. The efficiency of the distributed query processing engine: by centralising the datasets and comparing with other systems such as FedX or DARQ for the speed of result response while eliminating network delays. it would be good to also perform the same experiment over the remote endpoints and compare.
2. The correctness of the answers generated, in particular recall and precision values

What is the ground truth referred to in Table 3? How was this derived?

In Section 6.2 you claim that Chem2Bio2RDF is widely used. I would like to see evidence for this. Chem2Bio2RDF has not been updated since 2009 and as such it utility is very minimal these days. Also the statistics given for the number of triples and datasets does not match those given on the Chem2Bio2RDF website. Please give URLs for each of the datasets you are using.

Can you provide some interpretation of the results returned in Table 4 in comparison to other frameworks? How do the answers compare? which is more useful?

Your interpretation of Q7 in your SUS (needs citation) is incorrect. Question 7 is written in the negative form, so the response implies that your users found the system readily available.

typo on p14: form -> from

Review #2
Anonymous submitted on 04/Jan/2018
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This is a huge improvement over the original, particularly in terms of clarity and quality of writing. Language is mildly awkward in parts, but I think this is now fine to go ahead and publish. Responses to my prior comments are good. I think the work is quite novel and interesting.