Making the Web a Data Washing Machine - Creating Knowledge out of Interlinked Data

Paper Title: 
Making the Web a Data Washing Machine - Creating Knowledge out of Interlinked Data
Authors: 
Sören Auer and Jens Lehmann
Abstract: 
Over the past 3 years, the semantic web activity has gained momentum with the widespread publishing of structured data as RDF. The Linked Data paradigm has therefore evolved from a practical research idea into a very promising candidate for addressing one of the biggest challenges in the area of the Semantic Web vision: the exploitation of the Web as a platform for data and information integration. To translate this initial success into a world-scale reality, a number of research challenges need to be addressed: the performance gap between relational and RDF data management has to be closed, coherence and quality of data published on theWeb have to be improved, provenance and trust on the Linked Data Web must be established and generally the entrance barrier for data publishers and users has to be lowered. In this vision statement we discuss these challenges and argue, that research approaches tackling these challenges should be integrated into a mutual refinement cycle. We also present two crucial use-cases for the widespread adoption of linked data.
Full PDF Version: 
Submission type: 
Other
Responsible editor: 
Krzysztof Janowicz
Decision/Status: 
Accept
Reviews: 

Review 1 by Claudia d'Amato:

The paper analyzes a set of research challenges to make the initial success of the Linked Data paradigm a world-scale reality and suggests several research approaches to cope with these challenges.

In the following some detailed comments are reported:
* end of sect. 1: an example could be added
* beginning of sect. 2: the idea of integrating schema mapping and data interlinking algorithm
should be extended and particularly the role of schema mapping should be specified
* is it possible to sketch some proposals for solving the three challenges listed at the end of sect. 2?
* sect. 3: Machine Learning methods are usually grounded on the closed world assumption, differently
from the semantic web setting where the open world assumption is adopted. Some comments on the
possible customization that this difference would require could be interesting
* beginning of sect. 4: "For interlinking and fusing as well as for the classification,
structure..." -> please make explicit the classification of what
* sect. 5, 3rd point in the list: which is the goal of applying machine learning techniques? what
has to be refined?
* sect. 6: "The main issues of integration are the use of different identifiers for the same thing
and diversity in units of measure" -> is it possible to sketch some solutions to this problem?
* end of 1st column pp. 4: please motivate why centralized and top-down approaches are not adequate
for European governments and public administrations

MINOR:
* beginning of 2nd paragraph sect. 2: "The value of a knowledge base" -> "The usefulness of a
knowledge base"
* end of 2nd column pp. 2: "On the Data Web users are not" -> "On the Data Web, users are not"
* beginning of sect. 6: "Enterprise information integration...need for integration" -> this sentence
should be rephrased
* middle of sect. 6: "Classification, application of...disruption to infrastructure" -> this
sentence should be rephrased
* end of 2nd column: "references from the the Data Web" -> "references from the Data Web"
* middle of sect. 7, 1st column: "..be it supra-national..." -> "be" does not seem to be correct
* end of 1st column pp. 4: "of Europe this will be a very challenging due to" -> "of Europe this
will be very challenging due to"

Review 2 by Rinke Hoekstra:
The author describes a vision of the data web where multiple different challenges interact to improve the quality and interlinkage of the data available. This is a nice analysis of the problems facing current linked data and semantic web research, and the idea of the web as a 'Washing Machine' for linked data is very appealing. Perhaps that should be in the title?

One thing that is mentioned in the paper is provenance. This lets me wonder whether the author should include the more technical challenge of being able to represent provenance related information in a transparent fashion on the linked data web. I can imagine there are several more core technical challenges to overcome before this is a reality.

The second half of the paper tries to explain how this paradigm can be implemented in several use cases, service oriented architectures (deployment of linked data on intranets) and for government data. This part of the paper is slightly less well developed, and could be a bit more to the point as to how the washing machine metaphor can improve the application of the linked data approach in these areas.

There are some typos and minor issues:
* p.3. 'needs continue to grow, mergers' -> 'needs to continue to grow. Mergers'
* p.3. The statement about doubling warehouse sizes could use a reference
* p.3. 'entail substantial' -> 'entails substantial'

Tags: 

Comments

The work tries to identify ways to alleviate the shortcomings of the Linked Open Data (LOD) Cloud to make it practically useful. I am personally happy to see this work which tries to identify the shortcomings of the LOD Cloud and suggests ways in which it can be improved and used for pratical purposes. The work systematically identifies four major issues plaguing the LOD Cloud and explains as to how each of those affect the usability of the cloud. While it’s a laudable effort, I find it a bit difficult to follow some of the arguments because of the following reasons.
(1) Isn’t the issue with the performance of large-scale RDF Data Management more because of the slow performance of the query language (SPARQL) than with the storage systems? Since most of the well known and popularly utilized RDF Storage systems such as Virtuoso[2] , Oracle[3] RDF and such, have a RDBMS backend supporting the indexing and other data management issues. Hence, I personally think the difference in performance is less because of indexing techniques, and more due to the query language. Some of these systems translate the SPARQL query into SQL and then it is executed, hence it adds additional processing time. Further since the query being on graphs naturally takes more time than over a relational table. It will be very interesting, if the author could perhaps throw some light on a possible related issue of RDB to RDF mapping, widely utilized for exporting data on the LOD cloud. If that adds to the issue of poor data modeling and quality.
(2) Probably the author can point to some of the issues summarized by Jain et. al in [1]. The work is of similar nature and aims at identifying the shortcoming of the LOD Cloud and its effects on LOD usage.
(3) Adding a few examples establishing the fallacies being pointed out by the author will make it easier to follow, read and understand the seriousness of the issues. Additionally, some of the government datasets are already available for download through the LOD and to a larger extent utilize some common identifiers such as SKOS,FOAF. Information about European Union is already available as a part of the LOD Cloud [4]. Similarly other government related datasets are also available [5,6,7]. Probably the author can give pointers to these datasets.
(4) The idea about decentralized registries is in a way being pursued by the SPARQL WG through the use of the notion of “Service Description” [8]. Additionally, the idea in the past in the form of UDDI for Web Services hasn’t proved to be a successful and workable solution. It will be interesting if the author can give his views about how they can serve the purpose for LOD datasets.
(5) I completely agree with the notion of public involvement in the creation and consumption of the LOD cloud to make it truly Web of Data (like Web of Documents is ruled by user generated content). On the other hand, I think a problem with the LOD Cloud also stems from the fact that anyone and everyone is creating data and adding it to the cloud without giving any thoughts about how it should be modeled properly (e.g. misuse of owl:sameAs). Probably the solution to the problem lies in making additions to the Linked Data Design Principles by asserting specific pointers related to “quality control” of the data.
(6) In point 2, the author talks about “shortage of links” on the LOD Cloud. But isn’t a bigger issue on LOD is the quality of links than the quantity? There already exists work on automatic creation of links [9] . Perhaps the path to LOD nirvana lies in point 3 mentioned by the author.
(7) It might help, if the author can motivate Section 6 with an example. It can compare and contrast how the current state of LOD prevents corporations from using it, whereas if the quality of LOD is improved it can be put to a better usage by corporations.
(8) The citations are incomplete.

References
[1] Prateek Jain, Pascal Hitzler, Peter Z. Yeh, Kunal Verma, and Amit P.Sheth, Linked Data Is Merely More Data. In: Dan Brickley, Vinay K. Chaudhri, Harry Halpin, and Deborah McGuinness: Linked Data Meets Artificial Intelligence. Technical Report SS-10-07, AAAI Press, Menlo Park, California, 2010, pp. 82-86. ISBN 978-1-57735-461-1 http://knoesis.wright.edu/library/publications/linkedai2010_submission_1...
[2] http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSTriple
[3] http://www.oracle.com/technology/tech/semantic_technologies/pdf/semantic...
[4] http://www4.wiwiss.fu-berlin.de/eurostat/
[5] http://riese.joanneum.at/
[6] http://www.rdfabout.com/demo/census/
[7] http://www.govtrack.us/
[8] http://www.w3.org/TR/2009/WD-sparql11-service-description-20091022/
[9] J. Volz, C. Bizer, M. Gaedke, G. Kobilarov, Silk - A Link Discovery Framework for the Web of Data, in: Proceedings of the 2nd Linked Data on the Web Workshop, 2009.

Prateek, thanks a lot for your valuable comments! Unfortunately, the four pages given to me by the editors won't suffice discussing these at length.
While I agree with most of your comments, I disagree with your opinion about the slow performance of SPARQL. I'm absolutely convinced, that SPARQL querying can be substantially accelerated. The indexing and caching strategies used by current triple stores were mostly developed for relational databases and they are not very suitable for RDF data management. One the one hand you are right, since we trade performance for flexibility with a triple store it is inherently slower. On the other hand, we have currently unused opportunities to rearrange RDF data in ways, which improve querying for commonly used information structures. If we explore this direction e.g. with caching and materialized view maintenance strategies for RDF data we probably can reduce the performance gap between RDF and relational data management.

just realising that we partially touch upon on similar issues, Soeren, you may want to check our article (final version just submitted):

http://www.semantic-web-journal.net/content/new-submission-can-we-ever-c...

(not sure whether it's worthwhile to mutually cross-reference, or whether I can still add that, will ask the Editors...)