SpecINT: A framework for data integration over cheminformatics and bioinformatics RDF repositories

Tracking #: 1528-2740

Branko Arsić
Marija Đokić-Petrović
Petar Spalević
Ivan Milentijević
Dejan Rančić
Marko Živanović

Responsible editor: 
Michel Dumontier

Submission type: 
Full Paper
Many research centers and medical institutions have been accumulating the huge amount of various biological and chemical data over the past decade and this trend continues. Their associated information models, notions, areas of interest,units of measurement, parameters and conditions for experiments are different. Based on Linked Data vision, many semantic applications for distributed access to these heterogeneous RDF (Resource Description Framework) data sources were developed. Their improvements brought about a decrease of intermediate results and an optimizing query execution plans. But still many requests are unsuccessful and they time out without producing any answer. Also, the queries over different repositories with many data sources are not available. In this paper, the SpecINT is proposed as a comprehensive hybrid framework for data integration and federation in semantic data query processing over repositories. Innovativeness of the approach lays in the fact that the coordinates of graph eigenvectors are used for query join-ordering and translation of directed graph into federated SPARQL queries, instead of data statistics and classical algorithms which are applicable to the weighted graphs. Chemists and biologists could gain large benefit with the SpecINT by creating virtually distributed database as a resource for gaining new knowledge about chemical substances and compounds and their natural influence on environment. In experiments, we demonstrate the potential of our framework on a set of heterogeneous and distributed cheminformatics and bioinformatics data sources.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 19/Feb/2017
Minor Revision
Review Comment:

This manuscript described SpecINT framework and its application within the CPCTAS interfaces to see if substances had already been analyzed by using information in various public data systems. Humans appeared to appreciate the interface given the reported survey.

In general, the use of RDF to give real-time federated results, an appropriate SPARQL query generator, and analysis of returned federated results may be of interest to the community to show a real-world example of a semantic application in use.

The first two paragraphs of the introduction contain a number of sentences where the English could be somewhat improved to help readability. There were occasional sentences with issues. For example, these include: "performance of a biomedical research"; "the existing gaps and resource and time saver"; "Making an effort and investing a lot of time."; "The real-time solutions, which don't burden the scientists, hide its complexity from the user and offer tangible results, are desirable."; "This lead to collecting"; "graph construction are repeat for the another repository"; "to use) implie that our". In addition, US and European style of number representation used back and forth. E.g. "4.5" and "4,3".

Review #2
By Alasdair J G Gray submitted on 29/Mar/2017
Review Comment:

This paper is very poorly written both in terms of English and structure, but I believe that there is some really interesting work backing up the article. However, the two repositories limitation implies that the approach cannot be generalised to arbitrary set of distributed data sources.

If I have understood the paper correctly the authors have developed an approach to improve the execution of distributed queries. The minimally connected graph for the bgps in the query is computed based on a graph theory result involving Eigen vectors. However, the paper should be substantially rewritten to explain the approach that the authors have developed. Figure 2 is given to demonstrate the approach, but it is not explained what the nodes and edges represent, or why some edges begin with a bolded line/

The authors claim that their approach would work for arbitrary queries in any domain, but have only demonstrated it within the life sciences (that is not a major problem). However, the queries behind the current implementation have not been presented, nor sufficient details of how they are then expanded to span all the available datasources.

The approach seems to depend on mappings to a global schema. It is based on these mappings that the minimum spanning graph is created. If this is the case, the authors should be more explicit about this.

A query is shown in Listing 2 which is the output of the process. What is unclear is what is the input to that process, i.e. what is the specific form of the query prior to it being expanded to cover all the datasets? Do the instance level mappings take into account the differences in chemical strucutre between datasets?

The presentation of the experimental evaluations is very poor. They do not provide sufficient details fo the experiment setup. That is,

- What is the purpose of the experiment?
- What data sources, and specifically version of data sources have been used?
- What queries have been used as input?
- What are the dependent and independent variables?
- What is the baseline that you are comparing against? That is, how do you know if you are improving over the current state of the art?

In particular, table 3 presents numbers from two different translations of DrugBank. It is not surprising that Chem2Bio2RDF does not contain as many targets as it relates to a very old version of DrugBank (~2009).

Figure 3 presents a line graph for discrete data points, this is not appropriate. Also the caption should explain the ordering that has been applied. I'm also unclear as to which line represents your system and what substances have been used/

How is the performance of the whole expansion process versus the time to pose the query? How do the answers generated compare with other integration systems, i.e. are the answers correct?

The SUS evaluation is meaningless for the main contribution of the paper, i.e. it does not evaluate the distributed query processing engine. I would expect to see an evaluation that investigates

1. The efficiency of the distributed query processing engine: by centralising the datasets and comparing with other systems such as FedX or DARQ for the speed of result response while eliminating network delays. it would be good to also perform the same experiment over the remote endpoints and compare.
2. The correctness of the answers generated

There are a variety of typos and missing citations throughout the paper, too numerous to list.

Review #3
Anonymous submitted on 06/Apr/2017
Major Revision
Review Comment:

The paper presents a novel approach to SPARQL searching multiple federated triple-store repositories. The novelty comes from its ability to seamlessly search respositories that are internally the result of aggregation of multiple sources, and thus have many “sameas” kind of relationships within them. My understanding of the method is that it builds graphs of these relationships within each repository, then uses these to build appropriate SPARQL queries for each repository. In this way searching can be effected without a common ontology between resources

The method is interesting and novel, and in fact applicable outside the fields of cheminformatics and bioinformatics. The biggest problem with the paper is that it really needs editing by a native English speaker. The muddled grammar and use of vague statements makes it difficult to follow. I am also dubious of using questionnaires to evaluate the method: surely the method can be quantitatively evaluated with the question: did it retrieve all of the results that match the original (human) query?

As an additional minor point, the paper should reference the OpenPHACTS project (see openphacts.org for references).

If these issues can be addressed, I think this is a useful contribtion.