Bringing Relational Databases into the Semantic Web: A Survey

Paper Title: 
Bringing Relational Databases into the Semantic Web: A Survey
Authors: 
Dimitrios-Emmanuel Spanos, Periklis Stavrou and Nikolas Mitrou
Abstract: 
Relational databases are considered one of the most popular storage solutions for all kinds of data and they have been recognized as a key factor in generating huge amounts of data for Semantic Web applications. Ontologies, on the other hand, are one of the key concepts and main vehicle of knowledge in the Semantic Web research area. The problem of bridging the gap between relational databases and ontologies has attracted the interest of the SemanticWeb community, even from the early years of its existence and is commonly referred to as the database-to-ontology mapping problem. However, this term has been used interchangeably for referring to two distinct problems: namely, the creation of an ontology from an existing database instance and the discovery of mappings between an existing database instance and an existing ontology. In this paper, we clearly define these two problems and present the motivation and benefits for both of them. We attempt to gather the most notable approaches proposed so far in the literature, present them concisely in tabular format and group them under a classification scheme. We finally explore the perspectives and future research steps for a seamless and meaningful integration of databases into the Semantic Web.
Full PDF Version: 
Submission type: 
Survey Article
Responsible editor: 
Decision/Status: 
Accept
Tags: 

Comments

First of all, we would like to thank all reviewers for the valuable comments, which incredibly helped us in improving our paper. Our responses to the reviewers’ comments follow:


REVIEWER 1
This paper reviews methods and tools that bring relational databases into Semantic Web. It first describes existing approaches that generating ontologies from relational databases, and then examines the problems of discovering mappings between relational database and ontologies. As a review paper, it further summarizing existing approaches according to some representative features for easy comparison at a higher level. Finally, it also discusses future explorations in this area. This paper thus provides enlightening knowledge for users who are not familiar with this area, as well as offers useful foundations for future research.
________________________________________
REVIEWER 2
In this article, the authors review methods and tools that bring relational databases into Semantic Web from recent literatures. In particular, three similar but distinct subproblems (creating an ontology or domain-specific ontology from a given relational database, and discovering mappings between an existing ontology and a given database) are distinguished and discussed in detail respectively. As a great number of methods and tools have been proposed and developed to bring relational database into Semantic Web in the recent years, a thorough and complete review of such methods and tools is necessary for sure.
This paper covers most of the recent methods and tools, categorize all the related methods into a hierarchical structure, shown in Fig. 1, and briefly introduce the mechanisms of most representative ones. Especially, the authors compare the inputs and features of those methods and clearly list the comparison results in tables. In addition, the semantic awareness of the methods is compared among the methods.
Nevertheless, as a survey paper, I am afraid that this paper would be rather difficult for junior researchers to follow. The paper is not well structured, and some terms are not formally or informally defined in this paper, so that readers need to refer to other literatures for detailed explanation. Further, neither clear conclusion nor future direction is reached in this paper.

Indeed, certain parts of the paper used terms that assumed prior knowledge of the subject by the reader. We took the liberty to rewrite and rephrase a large part of the paper and included additional figures, bearing in mind this observation. For the same reason, we introduced Section 3: “Preliminaries” for the definition of terms and technologies used throughout the paper. We have also introduced a separate Section 8: “Future Directions” where we illustrate some future perspectives on the problems investigated.

First of all, author should emphasize the importance of the summarized sketch of approaches, i.e., Figure 1, Table 2 and Table 3.
Specifically,
1. As a survey paper, readers are most interested in the comparison of features and performances among methods and tools. To clearly address the differences among them, authors may give a brief introduction to the evolution of the methods. Thus, Figure 1 may be put in an earlier section instead of the section specific to the "creation of ontology", and a description of the motivation behind each evolution may also be attached to each node in the figure.

We introduced a new section 4: “A Classification of Approaches”, where we present the taxonomy of approaches, the classification criteria considered for distinguishing classes of the taxonomy, and also descriptive parameters that were used in the overview table containing all approaches (previously Table 3, now Table 8). Figure 1 has also been enriched, as it now includes next to each class of approaches the main motivations and benefits achieved by the respective class.

2. After approaches are briefly compared in Figure 1 centered section, readers are aware of the measures that will be used in Figure 3, e.g., existence of ontology, automation, etc., and thus, Table 2 and 3 might be given at the beginning of Section 4 and 5, and the introduction of each individual approach is centered around the tables, so that the readers can easily compare the approach with other related ones.

As we mentioned above, in the newly inserted Section 4, we introduce these descriptive measures and depict them in Figure 2. We decided to base the introduction of each category of approaches (namely, in Sections 5.1, 5.2.1, 5.2.2 and 6) on these measures, since often a measure has the same value for an entire class of methods and therefore, interesting conclusions can be drawn. On the contrary, we avoided basing the presentation of every single approach on these measures, as we felt this would make the text rather repetitive. Furthermore, since Table 8 (previously, Table 3) summarizes all the approaches mentioned throughout the paper, we thought it would be better suited for Section 7: “Discussion”, where some of the main points of the paper are summarized and discussion on challenges and performance evaluation takes place. As far as Table 6 (previously, Table 2) is concerned, given that it lists all input information sources for a certain group of approaches (namely, those of Section 5.2.2), it is introduced and referred to in this particular subsection. We made this choice, because the above factor is only relevant for tools presented in subsection 5.2.2 and is not a global descriptive parameter that characterizes all approaches. We should also note at this point that, in order to facilitate the user in comparing homogeneous approaches, we have also added Tables 4, 5 and 7 in Sections 5.1, 5.2.1 and 5.2.2 respectively. These tables examine in finer detail some aspects that pertain to a particular class, in contrast with the more general view that is provided by Table 8.

3. As stated in Section 2, the reader might be driven by some specific motivation (semantic annotation of dynamic web pages, ontology-based data access, etc.) to refer to this paper. Thus, he/she expect to find the right method and tool quickly and directly. Therefore, Table 2 and 3, as the essentials of the paper, should include the possible benefits of each method or tool.

We have added a column in the overview Table 8 (previously, Table 3) where we mention the main motivation or benefit that each approach claims to have or achieve. Due to space limitations, we could not include in this table every single application context that potentially every approach could be used in and we explicitly state this in page 12. Thus, the reader can also consult Figure 1 and, depending on the class the method under consideration belongs in, he/she can view the associated benefits.

Also. terms should be clearly defined, e.g.,
4. What is "meta modeling" in Section 4.1?

This term is now explained in the second column of page 14.

5. What is "domain-specific ontology"? Does it mean "to extract a subset" or "manually define his own mappings"? Although readers might infer its meaning from the context, the authors should give clear definition. In Section 4.2.1, we see all the approaches belonging to this category work as a "generation tool for general ontology" except that the tools extract a subset.

The terms “domain ontology” and “domain-specific ontology” are used interchangeably in the text. They have been properly introduced in the “Preliminaries” section (Section 3, page 7). The domain of the generated ontology is also used as a classification criterion distinguishing between approaches of Sections 5.1 and 5.2 (previously, Sections 4.1 and 4.2 respectively) and therefore, it is also analyzed in the last paragraph of page 9 in Section 4. Therefore, as explained in the text, all tools of Section 5.2 generate a domain-specific ontology in contrast with tools of Section 5.1 that generate a database schema ontology. This domain-specific ontology can be either generated manually according to user-defined mappings or extracted semi-automatically by analysis of the database schema.

6. What is "database logical schemas"? Is there any difference between the "logical schema" and the "schema"?

We distinguish between the terms “conceptual schema” and “logical schema” in Section 3, page 5. Since the only logical model considered in this paper is the relational one, we often use the terms “relational schema” or “database schema” or even plain “schema” in the place of “logical schema”. We also tried to make the distinction clear from the context of each term. Nonetheless, whenever we refer to a conceptual schema, we mention it explicitly.

7. What is "materialized" in Section 5?

This term has been properly introduced in the presentation of the data accessibility parameter in Section 4, page 11.

8. Different approaches favor different descriptive languages, e.g., ETL/SPARQL, R2O Language/RDF/XML, OWL/OWL DL/OWL-Lite, etc, and thus, authors should give a complete comparison among those different types of languages, and had better summarize them in a set of tables, ahead of the summarization for the approaches, for a more convenient further reference.

Most of these languages and technologies (e.g. the RDF/XML serialization format, the three OWL 1 species, SPARQL) have been briefly introduced in Section 3, where definitions have been deliberately kept as short as possible. As far as differences between the features of the three OWL species are concerned, we think that a summarizing table would introduce unnecessary complexity and elongate even further Section 3. Instead, we discuss some key differences between OWL Full and OWL DL that are of interest in the introduction of Section 5.1, page 14. Regarding the other languages mentioned in Table 8, these are mainly custom mapping definition languages that are not used outside the relational database to ontology mapping problem. An in-depth comparison of these relational-to-RDF mapping languages is an issue that would require a full paper on its own, as e.g. reference [64]. However, a brief comparison is given in Table 5.

Issues regarding the comparison results and conclusion:
9. Authors should employ a specific dataset or case study to compare those methods (including the mapping results, the number of correct predictions, the efficiency, etc.), as a supplement to the textually descriptive comparison, which is more valuable for the readers to choose the right method or tool. Although it might be difficult to evaluate approaches with different requirements of inputs and different types of outputs, readers still expect a common platform to compare the methods.

Throughout Section 7, we argue on the difficulty of providing an impartial evaluation methodology for the comparative analysis of all approaches presented in this paper. This is due not only to the variety of inputs and outputs considered by every method, but most importantly the high-level goal and motivation of each approach. As we examine each class of approaches on its own, we observe that different measures are more appropriate for each class. Therefore, we believe that a common evaluation procedure for all tools is impossible. Furthermore, for certain classes, the efficiency of each approach cannot be strictly quantified as it may be more qualitative in nature (for example, the amount of domain knowledge that is extracted from a database schema in the case of ontology learning approaches). Nevertheless, even for approaches that can be evaluated in terms of a quantifiable measure, the development of an objective benchmark is a difficult issue that needs special attention and space on its own (e.g. references [26], [57] and [107]). Nevertheless, we totally agree with the reviewer’s comment on the necessity of providing a common reference point for the comparison of approaches and we believe that, to some extent, this is partly achieved in Tables 4-8.

10. Figure 2 may be regarded as a visual comparison result of all the methods and tools, and thus, it's better to plot the features (e.g., automation) against the performances (e.g., semantic awareness) of each approach listed in Table 2 and 3. Moreover, is there any features or performance metrics other than automation or semantic awareness that could also be illustrated as Figure 2? If so, plot them as well.
11. How is Figure 2 discovered? Is there any parallel experiment? Why is the distance between the empty triangle and the filled circle much closer than the distance between the filled circle and the filled rectangle? Authors should use quantitative experiment to locate the position of each point on y-axis.

Indeed, Figure 2 was mainly qualitative and not based on any practical experiment, given that “semantic awareness” is also a qualitative measure. This figure was originally included in order to visually show the observed trade-off between the level of automation of an approach and the amount of knowledge it correctly elicits (the measure we refer to as semantic awareness). This observation is now described in the second paragraph of Section 7, page 30 and Figure 2 has been removed, since it is not based on formal analysis. Careful examination of all approaches and inspection of the summary of their features in Table 8 revealed no additional features showing this amount of correlation with each other.

12. Authors should also give a future perspective for the methods and tools of "ontology to database mappings".

Section 8 “Future Directions” has been added, where three issues that must be taken into account by future solutions were identified.

Other issues concerning structure or phrasing problems include
13. In Section 1, the paper lists three scenarios where the term "database to ontology" might be used. The second one "the export of a database's contents in RDF" is not further mentioned anywhere in the paper. Readers might ask what is the difference between the first and the second scenario.

This part has been rephrased. We now list in Section 1 some of the functionalities offered by the reviewed approaches, without correlating them with the classes of the taxonomy in Figure 1.

14. In the section paragraph of Section 4.1, advantages and disadvantages should be addressed separately. The sentences from the beginning of the paragraph to "without any user intervention" are devoted to the advantages, and without any conjunctive, the succeeding sentence support the disadvantage of the method. However, the last sentence gives a positive attitude toward the method. Authors should reorganize the paragraph.

This paragraph has been partly rewritten. The negative point originally mentioned is a result of the adoption of OWL Full, which also applies to some other methods in this class. Therefore, we elaborate in greater detail this point in the introduction of Section 5.1.

15. There is an extra dot between "Discovery of class hierarchy" and "Typically, hierarchies..." in Page 9.

In fact, there were two missing dots in the two following paragraphs. This has been fixed for uniformity purposes (now, in pages 25-26).

16. In Page 9, "since these features cannot elicited" → "since these features cannot elicit".

The grammatical mistake has been corrected.

________________________________________
REVIEWER 4
This paper presents a good review of various methods and tools for mapping data bases to ontologies. First, the motivation and benefits of mapping databases to ontologies are introduces; then, formal definitions of relational model and ontology are outlined, based on which a native mapping solution is defined; various tools and methods are divided into two categories: creating ontologies from databases, mapping databases to existing ontologies.
Although the authors make some comparison and analysis on the reviewed tools, there is no analysis based on real datasets. It will be better to have some experimental analysis, and the efficiency of different tools should also be discussed.

Discussion on the efficiency of tools and measures that can be used for their proper evaluation takes place in the newly added Section 7, while comments on efficiency can be found in the description of some tools (e.g. in the first paragraph of page 29 for MARSON). Section 7 also discusses the difficulty on carrying out a non-superficial, objective evaluation procedure for all of the approaches mentioned in this paper and gives some pointers to the few relevant works that give some insight on the efficiency of some of the tools reviewed in this paper. In our opinion, a thorough experimental analysis would involve 4 or more testbeds, one for every discrete motivation, as different measures - both qualitative and quantifiable - are of interest in approaches driven by different motivations. For example, whereas query response time is a typical measure used for the evaluation of approaches that allow for query based data access, it is not appropriate for measuring the efficiency of ontology learning methods or approaches that discover mappings between a relational database and an ontology. This is the reason why we think that such an analysis would have to be presented in a separate paper. Nevertheless, we have introduced additional Tables 4, 5 and 7 for an, as objective as possible, comparison of tools.


_____________________________________
REVIEWER 6
This is a comprehensive survey on the problem of mapping relational databases to ontologies. The authors laboriously review an extensive number of tools which have claimed to provide (partial) solutions to the problem. There are about 40 different tools mentioned in this survey. Many of them I am not aware of. Quite impressive! The paper classifies the problem into two different flavors: the extraction of ontologies from databases and the discovery of semantic mappings between existing ontologies and databases. Although the survey discusses several application scenarios that motivate the research on these problems, it is not very clear what the technical challenges in developing a solution are. It would make the survey more interesting if more technical challenges and solutions are presented.

Technical challenges for every group of approaches are summarized in the newly added Section 7, together with some discussion on the objective performance evaluation of tools and benchmarks.

I have a couple of suggestions for improving the quality of the paper. First, I think It would make the paper concise and cohesive by collapsing several sections. For example, the title of section 3 is A Naïve Solution. It is not clear which concrete problem the solution is presented for when the authors just classified the general problem into two flavors. I think the entire section 3 and section 4 could be collapsed and simplified with a focus on extracting ontologies from databases. The current presentation contains too many trivial details that don't contribute significantly to the problem.

Indeed, the first half of Section 3 contained some preliminary definitions, while the second half included a basic approach for translating a relational database to an RDF graph. The first half has now been extended to cover more terms and technologies that were used in the paper without being properly defined, while the second half has now been slightly rewritten and moved in the introduction of Section 5, a placement that is more relevant than before, since most approaches of Section 5 build on and refer to this basic approach.

Second, I think it would make the paper more comprehensive if the tools can be further classified in terms of problem challenges, technical solutions, and final results including mapping languages.

We could not think of a way to adapt our taxonomy further taking into consideration these aspects, as we felt this would complicate the analysis. Nevertheless, we try to clarify in Section 7 the challenges faced by every class of approaches as well as the general direction of the solutions proposed. Discussion on the mapping languages used takes place in the introduction of every relevant section covering a group of tools (namely, Sections 5.1, 5.2.1, 5.2.2 and 6) and of course, the list of mapping languages for all tools is given on Table 8.

Moreover, if the authors can collect the data sets used by the experiments of the tools and publish in a central place, that would greatly benefit to the community. Overall, it is a timely survey on the problem of mapping databases to the semantic web.

In Section 7, we identified the need for a thorough and objective benchmark for all class of approaches (perhaps with the exception of tools exposing SPARQL endpoints, which are compared in reference [26]). We have also observed that even evaluation of individual approaches is scarce. While not many, we gathered some of the datasets used for evaluation purposes by the approaches reviewed hoping that they could be reused by future approaches. The link is shown in footnote 20 on page 35.