Linked Data Wrapping as a Service

Tracking #: 1509-2721

Authors: 
Iker Azpeitia
Jon Iturrioz
Oscar Díaz

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
Abstract: 
Platform as a service (PaaS) allows customers to develop, run, and manage applications without being involved in maintaining the associated infrastructure. We believe Linked Data Wrappers (LDWs) might well benefit from this kind of ap- proach. LDWs have been proposed to integrate Open APIs into the Linked Data Cloud. Frequently, LDWs are developed in-house and its lifetime is that of the containing application. We advocate for LDWs to be externalized into third-party platforms. Be- sides the PaaS benefits, this approach addresses one of LDWs’ main hurdles: maintenance. API upgrades are LDWs’ Damocles sword. A LDW platform might well offer support for collaborative LDW definition and maintenance, extending their lifetime beyond their breakout applications, and in so doing, becoming a sound Semantic layer on top of existing Open APIs. This paper contributes to this vision through: (1) identifying a set of requirements for LDW platforms; (2) instantiating these requirements for Yahoo’s YQL; and (3) validating the extent to which this approach accounts for collaborative maintenance.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By jacek.kopecky submitted on 18/Jan/2017
Suggestion:
Major Revision
Review Comment:

The paper presents SYQL, a platform for linked data wrappers, built on top of Yahoo YQL and Open Data Tables. The paper's contribution is a) a set of requirements for linked-data wrapper platforms, b) a proof-of-concept implementation, and c) an evaluation of the platform's effectiveness.

The work is motivated well: linked data wrappers (LDW), as translators between API data sources and RDF, are an important component of the Web of Data, and the paper observes a number of issues common with LDWs.

The paper has several weaknesses that need to be addressed: i) requirements and their evaluation, ii) inclusion of non-archival material, and iii) various aspects of editorial quality.

First, the paper lists 7 requirements for LDW platforms, but it doesn't discuss how those requirements were arrived at – what is the source of the requirements, and what methodology was followed in eliciting them. The Requirements section should also quickly introduce the whole list of the requirements and then, perhaps, discuss them. The paper should include an evaluation of the requirements, based on lessons learned from building your proof-of-concept platform SYQL.

A crucial requirement is that LDWs should be provided as services, not by-products of application development. However, the authors ignore the economics of the situation: a service needs to make business use or be run as a charity (such as wikipedia). In many applications, the effort of externalizing the LDWs is not justified. The authors repeatedly claim that they don't expect altruism, but they do not point to any mechanism that wouldn't require it.

Secondly, the authors should consider *when* the paper will be read: initially, it will serve to let its readers know that you are working on the SYQL system; but later, it will mainly be a reference for the lasting contributions to human knowledge: the set of requirements (if it is methodically produced), your evaluation of the requirements, and the lessons learned from your proof-of-concept work. Conferences are a good venue for current work that may be obsolete when the next technology comes out, but journals are better for lasting contributions and generalizable results.

From this point of view, you should remove all the "demo time" content, and ruthlessly remove text about implementation details aimed at your potential (short term) users. You may include footnotes with pointers to online materials (demos, tutorials) and perhaps an online appendix about the technical details.

Thirdly, a good paper does not only require solid content, but also a good level of polish: readability, clarity, and accuracy. Below are selected suggestions for improving these qualities.

* The paper needs to use a better grammar. Please seek help with that.
* It needs to improve the references (e.g. the word "strony" in [68] and elsewhere is entirely inappropriate, [18] needs authors, [45] needs more bibliographic information, and these comments apply to many of the references)
* Figures need attention: often they try to show too much and that harms their usefulness. E.g. Figures 2, 4 and 16 show a sea of code that doesn't truly illustrate the point. Figures 5, 6 and 12 try to cram too many boxes into a small space.
* Section 2 is not entirely useful: a typical reader of this article should know this material. It might be useful simply to list the terms used further in the paper, with references for more information about them.
* Section 3 can be part of the introduction as it motivates and defines the problem addressed by the work.
* The paper feels repetitive at times; e.g. the quote from [24] just before MR3 on page 5 is already used in the introduction.
* MR5 is not defined very well: the description for "allow for resource lookup" is mostly about API keys, and a bit about provenance.
* In section 5.3, it is not clear why we would assign weather reports to pictures; indeed the picture should be about a place, and a place could have a weather report, and this information should be discovered from the LOD cloud, not in scope of a single LDW.
* The paper should explore the "unskilled target audience" angle mentioned in section 7.2. What are the benefits of semantic technologies to unskilled audience, and how can such audience really help contribute?
* The paper should explain how the platform can ensure backwards compatibility in LDW evolution.
* Section 12 should summarise the results in [43].
* Section 14 could use a reminder of what the requirements are, and it should explain why it skips requirements 1 and 2.
* The paper seems to conflate PaaS with running LDWs as a service. Requirement MR1 is about LDWs, while the text before, on PaaS, is about platforms being provided as a service. An app built on PaaS does not itself have to be seen as a service.
* "taping" -> "tapping"
* "petitions" -> "requests"

Altogether, the paper may contain a solid contribution, it just needs to bring it out and make sure it is consistent and well grounded.

Review #2
By Mohsen Taheriyan submitted on 21/Jan/2017
Suggestion:
Minor Revision
Review Comment:

Summary:

The paper introduces a framework to build Linked APIs (referred as Linked Data Wrappers, LDWs) by transforming the XML/JSON output of REST APIs into RDF. The generated LDWs are very useful in Semantic Web applications, they also contribute to the Linked Data Cloud (LOD) by linking the output triples to other resources on LOD. The paper first discusses the requirements for a LDW platform and then explains how the Yahoo YQL console can be extended to build a LDW framework meeting such requirements. The contribution is a collaborative LDW framework that provides the means to easily build and maintain LDWs. The authors evaluated the proposed framework by performing a user study to see how efficient the users can build/change LDWs. They also compared the new platform with several other tools to build linked APIs.

Detailed Comments:

The paper is well written, clearly states the role of LDWs in the Semantic Web and requirements of a good LDW platform, and articulates the details of the introduced platform supported by example and demo.

Employing the YQL console to define, use, and maintain LDWs is an interesting idea. YQL is a well-known platform and has a relatively large user base. I agree with the authors that the maintainability of LDWs is one of the main obstacles in expanding their usage, however, I think the more important reason is the burden to define the lowering and lifting schemas/rules, and I believe this is why LDWs have not been widely accepted by API developers (even in the SW community). Although the approach includes a nice plugin to semi-automatically do the lifting by annotating the JSON key values, it seems to me that still the users need to have some sort of expertise in SW technologies to work with the tool. Some tools adopt a better approach to generate the lowering and lifting instructions. For example, Karma provides a graphical user interface allowing the users to map the API data to an arbitrary ontology and then automatically creates lowering/lifting rules.

The proposed lifting plugin includes: resource definition, property mapping, association mapping, and nested mapping. This seems to be sufficient for mapping most of the API outputs, however, if I understand correctly, it does not cover certain cases. For instance, let’s assume instead of a JSON: photo : { id: 1, owner: {name: john}} the API returns the following structure: photo { id: 1, ownerName: john} (see Fig. 12). I don’t see how we can create the same mapping for this data. If there are limitations like this, please clarify this in the paper with an example.

The paper focuses on wrapping the REST APIs whose inputs are given as part of the invocation URIs, what about the APIs that take XML or JSON as input (using HTTP POST)? Do you have any idea of how to extend the lowering part of the platform to support such APIs? It is worth mentioning this in the Discussion section.

One of the areas that I think needs more thought is the LDW lookup (section 10). The authors suggest to use VoID to provide a semantic description of LDWs, nonetheless, I am not sure how much these descriptions help other people to find the right API they are looking for. Suppose that a developer needs to lookup the zipcode from the latitude and longitude in her application. How she can find a LDW with this functionality? The proposed LDW platform does not support fine-grained LDW discovery. Some other approaches such as iServe (https://www.programmableweb.com/api/iserve) and Karma annotate the API inouts and outputs (and also the relationships) enabling users to find the desired APIs by writing SPARQL queries.

Adding a quality checker in the developed LDW platform is very useful and one of the strengths of the tool compared to other existing frameworks. The paper has done an excellent job in categorizing requirements of LDW platforms and comparing the suggested framework with other existing ones regarding to those criteria. The user study is done in a small scale (only 10 graduate students) and I think needs to be extended to include more users with different roles (semantic web expert, domain expert, average internet users, and etc) and then compare the efficiency of using the tool for each category. Yet, it conveys the idea that users would be able to engage with the developed LDW platform easily (after some training) to create and maintain LDWs.

Review #3
By Thomas Steiner submitted on 21/Jan/2017
Suggestion:
Reject
Review Comment:

In this paper, the authors describe the concept of Linked Data Wrappers "as a Service", powered by the proprietary technology Yahoo Query Language (YQL). The core problem their approach promises to solve is Linked Data Wrappers becoming outdated and dysfunctional, especially when lifting and lowering steps are involved, potentially paired with (non-semantic) API requirements like authentication keys. The paper has been previously rejected in shorter form, as openly disclosed by the authors in the cover letter.

In a way, the work feels like a walk down memory lane back to 2009, when YQL was still new and Flickr the most popular photo sharing site. In the context of 2017, none of the ideas the paper introduces sound fresh, but over the years have been in the one form or the other described in dedicated events like the LDOW workshop series (http://events.linkeddata.org/ldow2017/).

YQL itself has strict rate limits (Per application limit [identified by your Access Key]: 100,000 calls per day. Per IP limits: /v1/public/*: 2,000 calls per hour; /v1/yql/*: 20,000 calls per hour, https://developer.yahoo.com/yql/) that constrains the concept of LDW as a Service severly, as YQL cannot be scaled without limits, a fact they authors do not discuss in depth.

LDWs need to be kept in sync with the APIs they wrap, else, if these APIs change, LDWs tend to stop working. The authors propose that collaborative development on a common platform and an open community might offer a way out to move LDW development from in-house development to collaborative development as promoted by YQL. Apart from the open scaling issues, another question arises around the workflow model of people maintaining LDWs and potentially causing conflicts (e.g., by disagreeing on whether each video tag should be a "schema:about" value [https://github.com/onekin/ldw/blob/master/flickr/flickr.videoobject.xml#... especially when there is no associated DBpedia resource).

Concluding, a paper that unfortunately comes a couple of years too late, but that does no longer convince in 2017. Even if one tried to abstract away from the authors' concrete technology choice of YQL and came up with a generic LDWaaS (LDW "as a Service") infrastructure, this would not have been new, for example comparing it to LOD Laundromat (lodlaundromat.org). The reviewer suggests the paper to be rejected for the SWJ Journal. The authors might still want to submit to LDOW 2017 (http://events.linkeddata.org/ldow2017/#submissions), it may still be of interest for practicioners in this community, especially as the authors have working code to show.

* Some more observations:

- "As an example, consider the Flickr API. This Open API facilitates programmatic access to user pictures." => "Open API" needs a definition of the kind of openness, e.g., the Flickr API requires an API key.

- Section 4, first paragraph uses MR{1, 2, 3,…} before defining them. What does MR in Table 1 stand for? Meta(requirement)?

- "LDWs must be registered on the Platform for the LDW’s URIs to be dereferenceable" => What platform? This becomes clearer towards the end, but initially the reader is surprised.

- Recently, Google, Yahoo and Bing join forces to provide a standard for Rich Snippets, i.e. a JSON-LD formatted sample of a site’s content on Search Engine Result Pages => The search engines agreed on the joint vocabulary Schema.org, not on Rich Snippets as one possible way to represent structured data.

- "Demo time", "Let’s check this out" => Too colloquial language.

* Typos (incomplete list):

- LDWs are developed in-house and its lifetime is that of the containing application. => LDWs are developed in-house and their lifetime is that of the containing application.
- A LDW => An LDW (multiple occurrences)
- being JSON-LD the preferable RDF format => JSON-LD being the preferable RDF format (not sure if the JSON-LD spec is the most objective source for this claim)
- Deployment is handle => Deployment is handled
- LDWs are seldom used => LDWs are seldomly used
- API requests and its optimization => API requests and their optimization
- they might require to get first an API key. => they might require to first get an API key.
- LDW fragility is a main risk for applications wanted to build a sound Semantic layer => LDW fragility is a main risk for applications wanting to build a sound Semantic layer
- hope that new people is involved in LDW maintenance => hope that new people are involved in LDW maintenance
- what would it be the offerings of this dedicated platform => what would be the offerings of this dedicated platform
- This accounts for more declarative wrapper specifications that easy user involvement => This accounts for more declarative wrapper specifications that ease user involvement
- wrapping is specified through R2RML ontology= > wrapping is specified through the R2RML ontology
- A LD => An LD (multiple occurrences)
- The question is to extent => The question is to extend

* Style (incomplete list):
- https://api.flickr.com/services/rest/?method=flickr.photos.getInfo&photo...{itemNumber} overflows
- [18,26] => [18, 26] (all occurrences of grouped citations)
- Consistently either use "Yahoo" or "Yahoo!", not both

Review #4
Anonymous submitted on 25/Jan/2017
Suggestion:
Minor Revision
Review Comment:

In this article, authors present a platform to develop and maintain Linked Data Wrappers (LDW) on top of YQL. The discourse is easy to follow, full of examples and supported by the Design Science principles. However, the structure would be leaner if some sections could be grouped together. I suggest authors to merge sections 2 and 3, to include sections 7 to 11 as subsections of a broader section such as 6, and to include section 14 as a subsection of the related work.

I find the authors' proposal very promising, offering a working PaaS to effectively simplify LDW maintenance. Authors successfully make their case with several examples and use cases, showing the applicability and relevance for the LD community. The available implementation is also a proof of its usefulness. However, there are certain aspects that should be improved in the article.

In particular, the requirements section contains some parts that are not clear enough. First, there is no clear reason to use the term "meta-requirements" instead of just "requirements". Then, even though authors mention API changes as one of the main factors to LDW fragility, it is not mentioned neither in Table 1 nor in the subsequent text. In fact, it may be one of the most important contributor to LDW fragility. Authors should analyze to what extent this exacerbating factor contributes to LDW fragility, and consequently which requirements are related to solving it. Table 2 should be also clarified, since it is showing concrete examples of LDW approaches, but not only dimensions.

Regarding the proposal, it would be good if the implicit workflow of a LDW platform like the proposed were made explicit, as an additional figure or at least including it in figure 3, for instance. Another aspect that is not sufficiently covered is how authors' proposal is collaborative. It seems that authors use git repositories to develop LDW, but this important aspect for the maintenance of such a platform is not well explained.

Finally, the extension mechanism used to define URI patterns in an ODT does not seem to be optimal. Why the sampleQuery element was chosen? Would it be possible to add new elements such as and instead? Authors should clarify why did they choose this option, and analyze other alternatives.

Additional minor issues:
- Please take into account overflown lines, especially in pages 3 and 18.
- A citation would help to support the claim on Open APIs scenarios on page 3, end of second paragraph.
- Page 3: When describing the lifting task, please explicitly refer to Figure 2 when pointing out the property-mapping arrow.
- In some parts of the article the language is a little informal, such as the "demo time" parts of sections 7-9. Please try to use more formal and consistent language throughout the article.
- Page 9: "permits navigate and explore" should be "permits the navigation and exploration" or adding a "to" before verbs.
- Page 15, URIExample paragraph: Instead of "For the same reason as aforementioned" please use "For the same aforementioned reason".
- Section 8: Authors should redirect the reader to section 11 when mentioning the quality check that is performed when registering an LDW
- Some references use the word "strony", which I believe it is Polish, when listing the pages of each cited article.

Review #5
By Ruben Verborgh submitted on 26/Jan/2017
Suggestion:
Major Revision
Review Comment:

Note: this is an additional review by the editor of this paper, in which I aim to clarify what I see as the major flaws of this work. For this reason, I will only zoom in on specific aspects.

This paper lacks a clear problem statement, and an evaluation that proves the proposed solution indeed adequately addresses that problem. While there is a Section 3 title “the problem”, the concept of “LDW fragility” is never explained in sufficient detail. The abstract, in contrast, lists “maintenance” as a main hurdle. So the assumption seems to be that the fragility of LDWs is due to the absence of maintenance. However, an additional assumption is that facilitating the maintenance process will indeed make LDWs more maintained and hence less fragile. This neglects other potential factors, such as simply lack of interest, usage, or the existence of other approaches besides LDWs. But even if we follow those assumptions, the evaluation does not allow us to conclude that the proposed platform makes maintenance easier (note the comparative form of the word). We only have a measurement in minutes for several participants, no comparison with other tools. So is, for instance, 3.2 minutes to change an API key better or worse than what we have? And will this (unstated) difference in time indeed contribute to higher maintenance, and thus less fragile API wrappers? The authors state in their conclusion that it “has yet to be proven” whether this is effectively the case, thereby indicating that indeed, their main claim has not been proven. As such, either the main claim needs to be changed, or the evaluation needs to be updated.

In any case, this paper needs:
– exact definitions of the concepts and the problem;
– an evaluation that compares different solutions;
– a conclusion that is supported by the evaluation.

A problem with the evaluation is the interpretation by the authors. The numbers of Table 5 are interpreted, attempting to justify numbers by “first contact” and “increasing familiarization”. None of this is certain however, and could be controlled by, for instance, a randomization of the order. Due to the lack of comparison, we don't know whether the task would have been easier without SYQL: maybe just writing hand-wired integrations is even faster. Or maybe another tool makes it easier. We just don't know; the Related Work section aims to compare, but only in a qualitative way, and insufficiently to prove the main point.
I think the claims you make can only be validated by having the platform run on the public Web for several months and seeing how maintenance evolves. So either the claims need to be adjusted, or the actual claims have to be validated.

An overarching problem is that the use of LDWs is seen as self-evident, whereas I don't think this is the case. As such, the need for LDWs should be argued, as well as their relevance on today's Web. It seems to me that LDWs were introduced as means to facilitate interactions with constantly changing APIs—and given their fragility, they are not doing a great job. Do we need LDWs at all?

The main problems of this paper are strongly reflected in its abstract. The authors “believe” that LDWs need PaaS (no evidence). Why would APIs need to be integrated in the Linked Data Cloud (not self-evident)? An LDW platform “might” indeed support this (not proven). “Collaborative maintenance” is not validated.

The argument in Section 3 is also broken: an argument from consequences is rather dangerous. It seems evident that, if LDW break, the Linked Data they deliver breaks. It seems evident that their applications break. That by itself, however, is not an argument that LDWs need to be maintained; perhaps LDWs are just not the right solution for today's Web. I also have strong doubts regarding the scalability of the LDW approach, if every single API would need to be wrapped by an additional server.