Review Comment:
This paper describes VIG a data scaler for ontology Based Data Access (OBDA) that takes into account the domain information of ontologies and mappings. It is supposed to be very efficient for generating huge amount of data in constant time, since it does not have to retrieve previously generated tuples. Additionally, according to the authors the scaling process can be delegated to different machines, scaling up to the number of columns in the schema without communication overhead.
Generally, the paper is well written and the significance of scaling datasets for Big Data is rather important. Some observations follow:
a) This work is an extension of the original work presented in the BLINK 2016 Workshop for Benchmarking. Altough I generally agree with the authors about the new content added in this revised work, I believe that some of the currently reported future work should be included in this paper. For example for completeness the authors should consider including support for other distributions except the uniform one (e.g. normal, power-law, etc.) when they generate values in columns, since this could help supporting more real-life datasets (as shown in the DBLP one). Regarding the multi-attribute foreign keys, I think that the discussion provided at the end of the article suffices to understand the complexity of the problem.
b) I would like to see some kind of complexity discussion about the different stages of the VIG algorithm, especially for the last steps of the analysis phase, i.e. the columns cluster analysis and the satisfaction of foreign keys, a problem that is encoded into a constraint satisfaction problem (CSP).
c) In the same manner I am not convinced about the linear complexity of the VIG algorithm. I would like to see more fine-grained experiments regarding generation times (which currently are only provided for the BSBM experiment) and with more and bigger scale factors (e.g. 1, 10, 100, 1000, 10000, 100.000). I guess aiming for a dataset of 10,000,000 products in 5.2.1 or the memory limits of the bigger HP server would provide useful insights (from Fig. 4 it seems that the space complexity of the VIG algorithm is rather low).
d) I would also like to see some kind of experiments supporting the authors argument that parallelization can scale up to the number of columns
e) I think the paper would greatly benefit from a table illustrating the different stages of the analysis phase of the algorithm for the running example. The current text is difficult to follow, due to the dense notation and the fact that the reader has to constantly check things in previous pages.
f) In 5.2.2 the authors make a comment about dependencies between binary tuples stored in different tables. This is a limitation of the VIG approach which I would like to see somewhere in the presentation of the algorithm (maybe with other limitations), and not in the evaluation section.
In the same section in the discussion of the results, the authors claim that VIG performs substantially better than RAND, which runs the tests twice as fast as VIG or NTV. I would like to see some kind of discussion on this argument though (why a slower query performance is better?).
Further I am not sure about the importance of reporting results for two different databases (PostgreSQL/MySQL). I guess I am missing some kind of discussion.
Although I am not an expert in benchmarking I guess that this work can be a nice addition to the relevant bibliography and that my comments can help the authors improve the quality of this work.
|