Review Comment:
This is the second attempt for the extended description of the RODI benchmark. The RODI Benchmark aims to provide a benchmark for relational-to-ontology mapping generation scenarios. In more details, it aims to assess the capacity of different systems to generate mappings which allow to generate an RDF dataset from some data originally stemmed in a relational database to answer a certain posed query.
Based on the aforementioned summarization of the paper and thus the benchmark function, I hesitate to accept that the mappings quality is actually assessed w.r.t. a query workload posed, as only on this version of the benchmark description it was clarified what mapping quality is considered for this paper. As far as I read from the paper, it is the (extend of) systems capacity to generate mappings that is under evaluation and not the mapping per se. As it is mentioned in the paper, different mappings may generate same RDF results. Those two mappings may differ in respect to some mapping quality dimension, but this does not affect the assessed results as long as the systems generate results (RDF triples) which conform with the benchmark’s gold standard. So, I am still not convinced how the benchmark enables assessing or increasing the mappings quality. Especially taking into consideration that it even allows systems to (blindly) generate their RDF results and feed these results in the benchmark for evaluation.
However, what mainly still puts me in thoughts is how balanced the benchmark is. As mentioned in my previous comments, based on the text description, I fail to find out how many scenarios, test-cases and queries exist per mapping challenge, category and per domain. Reading the evaluation results, I do have the impression that the authors do know which sets of test cases need to examine which aspects they cover but this is not communicated in the text. There is still (i) no consistency in respect to the input data (databases), schemas, output data (RDF results) and queries descriptions, (ii) no consistent description of the (number of) queries/test cases per domain and/or mapping challenge, (iii) no grouping or incremental addition of test cases coverage as domains are added with incremental size (or at least it is not clearly described) and, most importantly (iv) no clear definition of the measures taken into consideration.
To be more precise, there are three domains: conferences, geodata and oil & gas. The former has no information of the input database. How big is it? Is this of interest to know? If yes, why it is not mentioned as in other domains and where does the big or small size serve? If no, why is it mentioned for the others domains? There are three ontologies used, but based on Table 2, for instance, there is no evidence why the SIGKDD ontology is necessary to be taken into consideration given that it covers the same scenarios as CMT and CONFERENCE. In the end it is summarized that 23 classes and 66 properties were taken into consideration. The total number of classes and properties used are not described for all domains. And more, why is it of interest to know? How is it related to the input dataset, the queries and the output?
Then it is also mentioned that “we only generate facts for the subset of classes and properties that have an equivalent in the relational schema in question”. But how many times there is a correspondence? How many queries and test cases are generated? Why multiple test cases are necessary for each challenge or category tag? How many test cases are generated per tag/challenge? In this respect, I do not find any evidence in the text regarding how balanced test-cases are generated. Namely, it can be that there are more test cases generated for a certain aspect that turns a tool to seem not good for a certain measure, but that might not be the case.
This brings me to my final and more important remark, there is no clear definition of the benchmark measures. Coming to the evaluation, in Table 5, first dimension examined is related to adjusted naming, fine but which category tags are examined to draw conclusions about this? Which mapping challenge is covered or is associated with this? Which is the relevant mapping challenge? Then restructured hierarchies is mentioned that they are mainly related to the “n:1 mapping challenge”. First of all, n:1 matching appears to be a category in Table 3 and not a mapping challenge as of Table 1. I appreciate Table 3 and its alignment with Table 2 but why is it limited only to the conference domain? From Table 3, I get to know that restructured hierarchies are related to 6,9,10, and 12 sub-challenges which, in their turn are related to denormalization, class hierarchies and key conflicts. So what is meant to be the measure here? What are meant to be the measures overall? When one is in place to conclude that a system addresses the normalization challenge? Or any of its subtypes? Or are the categories the benchmark measures used to compare the systems?
In the end, the evaluation is executed considering scores for measures which are neither in the categories lists nor in the mapping challenges table. Why should we care about the results per domain? For instance, why is it of interest to have the scores for cross-matching scenarios per domain? I assume that one mainly cares regarding which categories and which challenges a tool addresses. So, I fail to clearly read in the text which test cases are added or which different test cases in each different domain/ontology scenario are covered so the domain level evaluation becomes of interest so I can conclude on what e.g. B.OX covers and what IncMap covers. From the text, it is given the impression that other challenges and categories are introduced (which I take them as the comparison measures) and other scores are used to finally evaluate the tools which are not aligned with the originally introduced challenges and categories.
In a nutshell, I miss the interpretation of test cases into categories and/or challenges addressed. In other benchmarks, it is normally clearly defined e.g. that the challenge is the speed and thus the time is the measure. We agree that benchmarking mappings is not as straightforward as benchmarking performance but I would invite the authors to read other benchmark descriptions and clearly define what they consider as their measures and posed the evaluation against those measures. I am not a benchmark expert myself but in order to complete this review, I did look into other benchmarks and I read in most of them that the dataset taken into consideration and the exact measures which are being evaluated are explicitly mentioned, while the tools under evaluation are assessed in respect to these measures, indicatively see [1], or [2] which might be even more comparable (but it also looks of smaller scale).
There is no doubt of the contribution of this work, especially taking into consideration the extended evaluation and the clarification of the current contributions compared to the previous paper. However, there are still some vague points which are of crucial importance.
Minors:
Heyvaert et al. [21] covers the two mapping generation perspectives introduced by some of us [9], so I would suggest that it is clearly mentioned in the text.
In the same context, the “mapping by example” is what is considered as “result-driven” by [9], clarify if it is not the case.
“Geodata domain has been designed as a medium-sized case” → What does it mean medium size?
In the same context: “For the Mondial scenarios, we use a query workload that mainly approximates real-world explorative queries on the data, although limited to queries of low or medium complexity” and “Those queries are highly complex compared to the ones in other scenarios and require a significant number of schema elements to be correctly mapped at the same time to bear any results”→ what are queries of low, medium or high complexity?
“To keep the number of tested scenarios at bay, we do not consider those additional synthetic variants as part of the default benchmark. Instead, we recommend these as optional tests to dig deeper into specific patterns” → Along the same lines with my main concern, no concrete number of total tested scenarios (test cases?), what are the optional tests?
Following that, I would suggest that the authors pay some attention on keeping the terminology same across the text.
"different modeling variants of the class hierarchy" → Which are?
There is no description of the set up used to run the benchmark (not that it matters a lot).
Neither of these papers, however, address the issue of systematically measuring mapping quality. → Which exactly do you mean? I think that [51] and [54] at least propose quality measures, whereas [7] does systematically measure a different though quality dimension.
[1] Voigt et al. Yet Another Triple Store Benchmark? Practical Experiences with Real-World Data
[2] Rivero et al. Benchmarking the Performance of Linked Data Translation Systems
|