DataGraft: One-Stop-Shop for Open Data Management

Tracking #: 1428-2640

Authors: 
Dumitru Roman
Nikolay Nikolov
Antoine Pultier
Dina Sukhobok
Brian Elvesæter
Arne Berre
Xianglin Ye
Marin Dimitrov
Alex Simov
Momchill Zarev
Rick Moynihan
Bill Roberts
Ivan Berlocher
Seon-Ho Kim
Tony Lee
Amanda Smith
Tom Heath

Responsible editor: 
Rinke Hoekstra

Submission type: 
Tool/System Report
Abstract: 
This paper introduces DataGraft (https://datagraft.net/) – a cloud-based platform for data transformation and publishing. DataGraft was developed to provide better and easier to use tools for data workers and developers (e.g., open data publishers, linked data developers, data scientists) who consider existing approaches to data transformation, hosting, and access too costly and technically complex. DataGraft offers an integrated, flexible, and reliable cloud-based solution for hosted open data management. Key features include flexible management of data transformations (e.g., interactive creation, execution, sharing, and reuse) and reliable data hosting services. This paper provides an overview of DataGraft focusing on the rationale, key features and components, and evaluation.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Aug/2016
Suggestion:
Accept
Review Comment:

The paper presents a tool for managing open datasets in a cloud-based platform, dubbed Datagraft.
Using the tool, it is possible to define, execute and monitor pipelines for publishing raw data as open data or linked data. The main components (backend and frontend) were introduced.
The paper was submitted as 'Tools and Systems Report'. It is reviewed considering two dimensions: (1) Quality, importance, and impact of the described tool or system; and (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool. In this case, the points listed below should be improved for accepting the paper.

I consider that all my comments have been satisfactorily corrected after the first review (http://www.semantic-web-journal.net/content/datagraft-one-stop-shop-open... - Review #2). Therefore, I vote for the acceptance of the article in this journal.

Review #2
By Christophe Guéret submitted on 18/Aug/2016
Suggestion:
Accept
Review Comment:

The paper is a revised submission describing a could-based solution for managing data in a DBaaS way. Compared to the original submission the authors did a significant work in addressing the comments they receive. The present version of the submission is clearer and more convincing. On a personal note I appreciate that the authors when down to actually working on the spreadsheet I gave as an example to prove DataGraft can deal with it.

Review #3
By Wouter Beek submitted on 07/Sep/2016
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

I would like to thank the authors for clear answers to almost all of
my questions and concerns! For me only two points remain. I believe
these should be easy to address by the authors in a minor revision:

- I was unable to find the definitions of terms like ‘cleaning’ and
‘preparing’, etc. in the revised paper.

- IIUC then the authors claim that quantitative evaluation of the
platform requires the construction of a benchmark for
cloud-services. While I do see the benefits of such a benchmark,
e.g., making it easy to compare systems with one another, I do not
believe that in absence of such a benchmark researchers should not
quantitatively evaluate the systems they build at all. For
instance, the number of concurrent users per node, the number and
size of transformations that is supported within a given time
unit, etc. are all things that can be measured without a
benchmark. The authors must have these, or very similar, numbers
because they are needed to configure the load balancing and other
properties of the cloud-hosted solution. It may also be possible
to mention how many users are currently using the system. This at
least gives an inkling of the viability of DataGraft as a
sufficiently scalable multi-user platform.

Thank you for clarifying the impact / external (re)use of the
DataGraft platform. IIUC the impact ATM mainly concerns reuse of
various software components that constitute DataGraft, but there are
also indications that the platform is being used outside of the
original development context.

The distinction between web- and cloud-hosted as well as the
distinction between cloud-based and cloud-hosted solutions was not so
clear to me before. This clarifies the delta WRT the LOD2 stack very
well.

Thanks for clarifying the benefit of pure / side effect-less functions
for data transformations. I understand now some of the claims

BTW, I'm impressed by the ability of the DataGraft platform to
clean/transform the ‘volkstellingen’ CSV.