Linked Data Wrapper Curation: A Platform Perspective

Tracking #: 1659-2871

Iker Azpeitia
Jon Iturrioz
Oscar Díaz

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
Linked Data Wrappers (LDWs) turn Web APIs into RDF end-points, leveraging the LOD cloud with current data. This potential is frequently undervalued, regarding LDWs as mere by-products of larger endeavors, e.g. developing mashup applications. However, LDWs are mainly data-driven, not contaminated by application semantics, hence with an important potential for reuse. If LDWs could be decoupled from their breakout projects, this would increase the chances of LDWs becoming truly RDF end-points. But this vision is still under threat by LDW fragility upon API upgrades, and the risk of unmaintained LDWs. LDW curation might help. Similar to dataset curation, LDW curation aims to clean up datasets but, in this case, the dataset is implicitly described by the LDW definition, and “stains” are not limited to those related with the dataset quality but also include those related to the underlying API. This requires the existence of LDW platforms that leverage existing code repositories with additional functionalities that cater for LDW definition, deployment and curation. This paper contributes to this vision through: (1) identifying a set of requirements for LDW platforms; (2) instantiating these requirements for Yahoo’s YQL; and (3), validating the extent to which this approach facilitates LDW curation.
Full PDF Version: 

Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Mohsen Taheriyan submitted on 07/Jun/2017
Review Comment:

The revised draft has greatly improved the original version. The authors have addressed most of my comments and I am satisfied with the current version. I think the authors have adequately addressed most of the concerns raised by other reviewers too, and would like to recommend this paper for publication.

Review #2
Anonymous submitted on 13/Jun/2017
Minor Revision
Review Comment:

I acknowledge the effort that authors have put into this revision. The refocusing of the paper towards the curation is more appealing, and the requirements are more clear and useful to compare with similar approaches than in the previous version. They have also successfully tackled my comments, though I still have some minor issues.

Regarding the re-organization, though it has improved the previous version, I still find that it should be further improved in order to accept the paper. First, section 2 (especially 2.2) seems to contain the background and state of the art discussion on LDW, but the section title does not really convey this. Then, section 3 title should also be modified to better sum up the following content: it is actually a good motivation, presenting compelling use cases. Hence, I suggest authors to include some of these keywords ("motivation", "use case"...) in the title. Similarly, the wordings on sections 5 and 9.1 titles should be revised. The former should put the focus on what is described (i.e. the architecture), and the latter sounds a bit informal/flashy ("Experimental Results" or so may be better). Finally, I still find advisable to group together sections 5 to 8 in an overarching section which actually presents the authors' proposal. Similarly, section 10 should be included as a subsection of the evaluation section (9).

Another concern that has occurred to me is the choice of YQL as the underlying solution to develop LDW. I am convinced that the infrastructure already in place is useful and interesting to build up a LDW platform, but I am not sure about the numbers authors show to claim the strong community that is behind it. I have found that in the past 4 years there has been little activity in their Github repository, so it would be better if authors could further support the choice of this technology with additional reasons.

Finally, it is not clear to me what are the actual differences between LDW discovery and LDW lookup. In other words, even though authors state that discovery is not supported, by integrating Hydra and other additional vocabularies (e.g. Linked USDL, OpenAPIs...) to describe LDW, one could make use of that information to actually discover LDWs that are relevant for their needs. Please elaborate on this topic.

Additional minor comments:
- Please include section numbers in the outline at the end of the introduction.
- As part of the suggested reorganization, authors should try to avoid some repetitions that are still in some passages of the text (e.g page 9).
- Figure 2 does not highlight the association mapping to DBpedia, though it is noted in the text.
- Figure 7: actors may be better labelled as developers, users or curators.
- Figure 11, line 32: Is this actually a LDW? If so, it should be noted in the text that describes this figure.
- Footnote 6: If this is actually an option that may be supported by SYQL, I suggest authors to promote this footnote to a normal paragraph and elaborate on it.
- Authors should also conduct a proofreading to check for some grammar/spelling mistakes and rewrite some obscure/informal sentences:
-- Page 2, section 2, 2nd paragraph: "extent" -> "extend"
-- Page 5, section 3, 1st paragraph: Please rewrite the sentence
-- Page 7, last paragraph of left column: "close the chasm" -> integrate? reconcile?
-- Page 8, 1st paragraph of right column: "The issue:" -> rewrite to use a full sentence. Please note that his construction is used in several places of the paper, too. This style does not feel appropriate for me in a scientific paper.
-- Page 8, section 4, 1st paragraph: "...just to motivate the need" -> the need for what?
-- Page 20, task 2.1 description: "an Property" -> "a Property"
-- Page 20, last sentence before section 9.1: remove it.
-- Page 23, 1st paragraph after the item list: Rewrite the first 2 sentences.
-- Page 23, last sentence: "We would like also to..." -> "We would also like to..."

Review #3
By Milan Dojchinovski2 submitted on 04/Aug/2017
Major Revision
Review Comment:

This paper presents an approach for management of Linked Data Wrappers (LDW).
In particular, the paper identifies requirements for an LDW platform and implements them in the YQL language.
The problems and challenges addressed in the paper are valid and require attention by the community.
These challenges have been already addressed in the past years, however without any significant impact and take-up.

In general the paper is well structured, however, some parts could be better structured (more on that in the comments below).
The paper is difficult to read and follow since:
- the contributions are not clearly listed early in the paper
- the advancement over the state-of-the-art is not clearly described
- the vision (concept) and the actual realization of vision (the platform) are not clearly separated
- it is not clear how individual components of the platform realized (section 6, 7 and 8). There is some description, however, it is difficult to get a clear overall picture of the platform

Some more specific comments related to individual sections:

= Abstract
"Similar to dataset curation, LDW curation aims to clean up datasets but, in this case, the dataset is implicitly described by the LDW definition, and “stains” are not limited to those related with the dataset quality but also include those related to the underlying API. This requires the existence of LDW platforms that leverage existing code repositories with additional functionalities that cater for LDW definition, deployment and curation."
- This part is very unclear. 1) what do you mean by "to clean up datasets"? A Web API is not a dataset, but yes it can expose data, but not only. Also, functionalities (usually) are exposed via a Web API. Please **rephrase/simplify** this sentence.

= Introduction
"Web APIs are an important source of current data."
- Again, Web APIs expose not only data but also functionalities. Please fix this. Or clarify in the paper that you deal with Web APIs which serve data, not functionalities.

- fix the reminder: "Finally, SYQL is confronted with other platforms with related aims. Conclusions end the paper." - link to appropriate chapters

- if there is proper versioning of the API in place, it can assure that the LDW is valid for that version. See possible options for versioning at
Maybe you can elaborate more on this? Would an LDW work if the underlying API supports versioning?

- contributions are not clearly described in the introduction. State what are the contributions and briefly describe them so the reader knows what to expect. Currently, by reading the introduction it is not clear what should be expected in the rest of the paper. Be precise, what your contributions are, for example: a platform for management Linked Data Wrappers realized by...

- From the introduction and the abstract, it looks that the paper is about "wrapping APIs so that they produce RDF". However, in section 2.2 you list framework/tools which have nothing to do with API wrappers, for example DBpedia. Please either focus on wrapping Web APIs, or rewrite the abstract/introduction so that you do not put the focus of the reader only on Web APIs, but in general "exposing semantic content on the Web".

= Section 2. The practice: Linked-Data Wrapping

- The title of the section "2.2. LDWs as separated artifacts" is not clearly describing the underlying content. Suggestion: "Existing/Current LDW approaches" - since you primarily review them in Sec 2.2.

- you classify DBpedia as LDW approach. I don't agree with this since DBpedia is created by "extracting" data, and not by "wrapping" data. Please remove DBpedia as LDW approach.

- page 4 "DBpedia converts Wikipedia HTML pages," - this claim is not true. The DBpedia extraction framework does not work over HTML pages, but over Wikipedia dumps.

- page 4, "SA-REST focuses on Web Services. But it is the wrapping of REST API’s where more initiatives showed up." - this is very unclear.

- page 4. in the part where you discuss the "Data Sources" dimension you do not mention Karma, LEEDS, SWEET, xCurator,....

- page 5. rename "Tooling." to "Mapping Tools" or smth in this direction. "Tooling" it too broad.

- page 5. "DBpedia shares wrappers as wiki pages." - in the DBpedia community terminology "wrappers" are called "mappings". Please use proper terminology.

= Section 3. The case for LDW Platforms

- in this section the authors motivate their work by: 1) explaining the lifecycle of a LDW, and 2) providing motivating scenarios. I would recommend to move 1) lifecycle of LDW, to the previous section 2 (there it fits better).
The 2) motivating scenarios are very specific, described in much details. The scenarios you describe read like use cases of the proposed platform. I recommend you move these "use cases" at the end before/after the evaluation section and you show on these examples how your platform can be utilized.

= Section 4

Requirements are in general OK.

Fix sub-section titles:
- page 8. 4.1.1. Allow for LDW definition -> maybe: "LDW specification"
- Page 9. 4.1.2. Allow for LDW deployment ->"LDF deployment"
- page 9. 4.2.2. Allow for LDW lookup -> "LDF lookup"
- page 9. 4.2.3. Allow for resource lookup -> "Resource lookup"
- page 9. 4.3.1. Allow for spotting stains -> "Spotting stains"
- page 9. 4.3.2. Allow for cleaning up stains -> "Cleaning up stains"

= Section 5 A platform for LDW management: architecture

- This section is problematic. From the title one expects that you present an architecture of the LDW platform, but actually in this section you describe (S)YQL and ODT.
- the first sentence "This section fleshes out the LDW platform vision..." - is this vision or reality? From the screenshots, it looks that this is something concrete.
- it's not clear the role of SYQL for LDW, early in this section describe the role of YQL into LDW

= Section 6. Addressing Producer Requirements

- In the creation of LDW, lifting is one of the most crucial phases. In Fig. 11, the authors present an example of specified LDW, where the lifting implementation is also presented. The lifting approach is very naive and does not bring any innovation. It is basically a javascript code which does the transformation from XML to JSON-LD.
What about using RML for lifting - it's much more elegant.

- can you deal with listings? can you iterate/loop and process (generate RDF) each piece of data in an array?

= Section 7. Addressing Consumer Requirements

Fig 14. The sequence diagram does not make sense. The invoked method by the client(application) is "lookup", which logically will perform lookup/search/discovery, however, it is actually used to invoke a method on an actual service.
I would expect:
- first, lookup for a service (service discovery step) - returns one or more service candidates
- second, actual invocation of a chosen service by the application - returns results from the actual execution of a service in RDF

= Section 8. Addressing Curator Requirements

- this section is well developed and based on relevant previous work on Linked Data Quality assessment.

= Section 9. Evaluation
- the evaluation is well conducted

= Section 10. Comparing SYQL with other platforms
- rename to "Related Work"
- the section is ok.

= Section 11. Conclusion

- you cite "There is no reliable business model to finance the curation and maintenance of data repositories ... Crowdsourcing models are promising in this respect because data producers ensure that the deposited data are accurate and reusable, but these models are still not widely deployed"
... this raises several crucial questions which should be addressed in the paper:
1) how do you plan to gather a crowd?
2) how will you motivate people to create LDW?
3) how will you motivate people to maintain existing LDW?

Another important question which should be addressed in the paper is:
- You propose centralized approach for management of LDW at your platform, how will you deal with the scalability of your solution? What about distributed LDW deployment? Maybe at the client side?

One more crucial issue:
- RML ( is very relevant work which is not mentioned in the paper. Please add it to the Table 1 and refer to it in the paper.

Some language issues:
- Section 2., page 3. "web APIs." -> Web APIs - be consistent
- page 4, "the creation time," -> "the creation time interval"
- page 6. It permits navigate and explore multiple –> fix the sentence
- page 16 - fix spacing between rows

To sum-up: the paper addresses a relevant problem which has been already addressed in the past in several other works. These works, however, did not make any significant impact and take-up. The paper is in general well structured but some major structure updates should be done. The contributions are not clearly described throughout the paper - in other words, it is not clear what has been developed and how. The authors also missed relevant work (RML). The proposed solution is highly dependent on the crowd, however, the authors do not explain how a crowd will be gathered and engaged. Most of the paper is difficult to read and follow - it requires simplification and low-level examples.