VIG: Data Scaling for OBDA Benchmarks

Tracking #: 1602-2814

Davide Lanti
Guohui XIao
Diego Calvanese

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
In this paper we describe VIG, a data scaler for Ontology-Based Data Access (OBDA) benchmarks. Data scaling is a relatively recent approach, proposed in the database community, that allows for quickly scaling an input data instance to s times its size, while preserving certain application-specific characteristics. The advantages of the scaling approach are that the same generator is general, in the sense that it can be re-used on different database schemas, and that users are not required to manually input the data characteristics. In the VIG system, we lift the scaling approach from the pure database level to the OBDA level, where the domain information of ontologies and mappings has to be taken into account as well. VIG is efficient and notably each tuple is generated in constant time. To evaluate VIG, we have carried an extensive set of experiments with three datasets (BSBM, DBLP, and NPD), using two OBDA systems (Ontop and D2RQ), backed by two relational database engines (MySQL and PostgreSQL), and compared with read-world data, ad-hoc data generators and random data generators. The encouraging results show that the data scaling performed by VIG is efficient and that the scaled data are suitable for benchmarking OBDA systems.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Manolis Terrovitis submitted on 19/May/2017
Major Revision
Review Comment:

VIG: Data Scaling for OBDA Benchmarks
In this paper, extends previous work [15] by providing additional experiments that highlight the applicability of the proposed VIG system. VIG is a data scaler for OBDA benchmarks and is tested on three datasets (BSBM, DBLP, NPD), using two OBDA systems (Ontop, D2RQ), backed by two relational database engines (MySQL, PostgreSQL). There is no additional technical material compared to [15].

Strong Points:
• A new dataset (DBLP), a new OBDA system (D2RQ) and a new relational database engine (PostgreSQL) are additionally to [15] taken into consideration in the experimental part.
• VIG outperforms a random generator approach on DBLP. The fact that this good quality scaling is achieved on DBLP that is a real database instance and not a synthetic one like BSBM, makes VIG system more valuable and trustworthy for data scaling purposes.

Weak Points:
• The paper does not provide any new technical content compared to [15].
• The 16 queries used for the VIG evaluation on DBLP dataset are manually crafted, and there is no intuition provided behind their creation. A workload created to resemble some real world usage or created according to some realistic assumptions would better highlight the capabilities of VIG.

Review #2
Anonymous submitted on 27/Oct/2017
Minor Revision
Review Comment:

Ontology-based data access (ODBA) systems create a layer of
abstraction over the storage solution to present to the application
programmer a virtual RDF graph. When the storage back-end is a
relational database, the ODBA system automatically creates and
populates a complex relational schema. As some of the statistical
properties of ODBA instantiations are due to the fact that they store
an RDF graph, they are present regardless of the application that
populates the database.

This observation is exploited by the authors to develop a data scaler
that is aware of the idiosynchracies of ODBA instanciations. The
description of the algorithm that scales data while maintaining join
selectivities is clear and the explanation is accompanied by a running
example that helps the reader understand the algorithm.

Regarding the evaluation, what I missed is a comparison with an RDF
data scaler. The authors compare against directly populating the
database without being aware of the RDF nature of the data, but a more
fair comparison would be against a database populated with scaled-up
RDF data via ODBA. I expect that in such an experiment the authors
would find no difference in the quality of the socaled-up datasets,
but VIG would be a lot faster.

Specific editorial comments:

The paragraph in the introduction with the implementation details
(license, Github pointer) breaks the flow of the text. This is very
important information, but I would place it either in the conclusions
(in the sense of "if you found what you have read interesting, here is
an implementation") or in the experimental setup (Section 5, where the
information is acually repeated). Either way, mention implementation
details only in one place, and not in the introduction.

The same goes for the unnecessary details about specific systems used
in the experimental setup (PostgreSQL, ontop, etc.) These details are
important, but not appropriate for the introduction. They should be
given in the experimental setup section and, in fact, provide more
details (specific versions used in the experiments).

Section 4 is the meat of the submission, as it presents the algorithm
that prepares the data generation phase and guarantees that consistent
data can by generated in constant time (ie., without comparing against
previously generated data). This is an important contribution and, it
seems, the bulk of the methodological work. However, the opening of
Section 4 fails to explain this to the reader and instead starts of
with details about the random number generator. I suggest the random
generator text is placed in a subsection with preliminaries, and the
opening paragraph is expanded to explain what the section is about.

Review #3
Anonymous submitted on 06/Nov/2017
Major Revision
Review Comment:


This submission presents a data scaling approach for benchmarking Ontology-Based Data Access (OBDA). It builds on OWL and SPARQL as well as R2RML. The main technical part of the paper is the VIG algorithm which is divided in two phases: collecting information about the initial data and then growing the initial data based on the previous analysis with the stated scale factor. The approach is evaluated with several case studies and technologies.


The paper is well-structured and easy to read. The investigated problem is relevant and extends existing approaches for scaling pure database systems working soley with relations. The experimental results seems promising and also point out future research lines for improving the presented approach.

While in general I appreciate the approach and presentation, there are also points which should be improved. In the following, I detail these points.

1) The paper is an extension of a previous publication at the BLINK workshop. This is also mentioned in the introduction and as difference is the evaluation mentioned which is improved in the journal submission. However, it would be really interesting for the reader, why the evaluation has been extended? What can we learn from the evaluation of this submission with respect to the previous workshop evaluation? I assume some aspects are evaluated deeper or more results have been needed to derive some specific conclusions. This should be discussed also shortly in the introduction. In particular, the question comes up what does the DBLP case contribute to the evaluation? Is it about the real-world data aspect?

2) Concerning the evaluation, I am impressed about the different settings which have been used to evaluate the presented approach. However, as the evaluation section is long and many aspects are discussed, I would propose to introduce for the evaluation section research questions which should be answered by analyzing the different cases. In Section 5.4 these questions should be explicitly answered and threats to validity should be discussed. Maybe also related to my previous point about the additional evaluation cases mentioned in the introduction, it would be interesting to discuss the characteristics of the cases.

Also the setup of the evaluation study and the discussion of the results could be more separated. For instance, the discussion of the adaptation of the BSBM case is very detailed and somehow one get lost in these details when reading the evaluation section - mostly due to jumping between setup of the study and the results.

3) Related work: The distinction to Rex [4] is not clear. It is only mentioned that Rex has a better handling of content for non-key columns. As it is mentioned as a strictly related approach, a more detailed discussion would be needed.

4) I was wondering when reading this paper why the scaling is done on the database level and not on the ontology level. Would it be easier? Currently the mappings have to be exploited to reason about proper ways to scale the data. I have somehow the impression that using directly the ontology would allow to exploit more semantics to reason about proper ways of scaling data. At least, a discussion would be interesting for the reader why it has been decided to do the data scaling on the database level. I understand that the data is originating from the database, but why not scaling the corresponding ontology graph?

Minor issues

Abstract and Introduction: "read-world data" -> "real-world data"

Page 10: data that are -> data that is

Page 12: FactPages ^12 -> please delete the space between the word and the footnote

Page 14: cannot are not taken -> delete cannot