XMLSchema2ShEx: Converting XML validation to RDF validation

Tracking #: 1759-2971

Herminio Garcia-Gonzalez
Jose Emilio Labra-Gayo

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
RDF validation is a new topic where the Semantic Web community is focusing attention while in other communities, like XML or databases, data validation and quality was considered a key part of their ecosystem. On the other hand, there is a recent trend to migrate data from different sources to semantic web formats. These transformations and mappings between different technologies come at a price. In order to facilitate this transformation, we propose a set of mappings that can be used to convert from XML Schema to Shape Expressions (ShEx)—one of the recent RDF validation languages—. We also present a prototype that implements a subset of the mappings proposed, and an example application to obtain a ShEx schema from an XML Schema one. We consider that this work and the development of other format mappings could drive to a new era of semantic-aware and interoperable data.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Felix Sasaki submitted on 26/Nov/2017
Review Comment:

I am satisfied with the responses to my comments and with the rewrite of the paper, and thank the authors for their efforts.

Review #2
By Emir Muñoz submitted on 09/Dec/2017
Minor Revision
Review Comment:

I appreciate the work done by the authors to address all comments, including my comments and the comments of the other reviewers. Because of all the effort put in this new version, the manuscript is more self contained and readable, leaving less space to misinterpretation and missing information. However, I still have several minor comments over the new version of the manuscript, which I would like the authors to comment and act on.

Most of the minor comments are still on the writing style and presentation. First, the introduction section still needs some revisions in order to motivate the work done. But the most critical issue that I see is in the related work section. This section pretty much lists other's work but still does not compare and positionate the present work with existing works. Please notice as well that your relate work is not only limited to "other approaches that convert from XML Schema to ShEx"! You should cover all schema mapping approaches from XML to Semantic Web formats. Also cover all schema languages in XML and RDF: mention their main feature, differences, limitations, etc. And you should write this as a continuous text instead of just one sentence per work without real connections among them.

Second, regarding the research focus of the manuscript, I am glad to see that the authors now state clearly their research questions. However, the first one I find it too narrow, and should be revised. Also, the authors should match the rest of the paper, mainly discussions and conclusions, with these questions to give a good closure to your work. At the end, that is your research contribution made in this work.

In the following, I give my comments per section. (Please use the same format in your response.)

### Abstract

1. “Mappings between different technologies come at a price” --> what do you mean by price here? E.g. runtime, human hours? By explaining this you could improve the motivation and need for such a tool.
2. The em dash is not needed at the end of a sentence.

### Section 1

1. “one of the key areas” --> “a key area”
2. “Reliance” --> what do you mean by reliance in this context?
3. You mention normalization as something desired, but you don’t define its meaning in this context. Please add a small definition for it.
4. The quote from P.N. Fox et al. should be formatted in cursive text.
5. “than DTDs” --> remove the last comma
6. “With XML Schema, developers ...” --> “Using XML Schema developers” (no comma)
7. “Alongside the appearance of DTDs and XML Schema” --> “Besides DTD and XML Schema”
8. “Other alternatives” --> add “for XML validation”
9. “Unlike XML, RDF” --> “In the Semantic Web, unlike XML, RDF” (add context)
10. Missing a reference to OWL. What OWL stands for?
11. Missing a reference to RDFS
12. “does with XML [29]” --> “does for XML [29]”
13. Reference [26] should actually go earlier together with [25]
14. You mention SHACL without a reference. What is the role of SHACL? and What is SHACL? (considering a reader not familiar with the terminology)
15. “migration and interoperability needs are nowadays more pressing than before, many authors ...” --> “the need for migration and interoperability to more flexible data formats is nowadays more pressing than before. Many authors ...”
16. “from XML to RDF [20, 8, 1, 5], which have the goal” --> “from XML to RDF [20, 8, 1, 5] with the goal”
17. “a lacking process when converting XML to RDF is validation.” --> “means for validating the output data after converting XML to RDF are missing.”
18. “How to be ensure that” --> “How to ensure that”
19. In the second last paragraph of Page 1, you explain (using i.e.) a how-to question with another how-to question. Try to explain with simple words the first question.
20. “Conversions between XML and RDF and” --> “Conversions between XML and RDF, and” (add comma)
21. Rephase the following sentence: “With that in mind, providing migrations from in-use technologies to semantic technologies can enhance the migration possibilities.” Please clarify and give an example. Basically, you say “providing migrations … can enhance the migration possibilities”.
22. “in other cases like small companies” --> “for instance small companies”
23. “low budget projects they can make” --> “low budget projects can make”
24. “Taking TEI [10] as an example” --> What is TEI? What is its relevance in the text?
25. “There are a lot of manuscripts” --> “There are many manuscripts”
26. “underlying technology despite they can benefit” --> “underlying technology although they can benefit”
27. “Those are the cases where generic approaches can” --> “Those are the cases where generic approaches, as the one introduced here, can”
28. “a solution and, therefore, automatic conversion of schemata has its space ...” --> it is not clear how the conclusion (part after therefore) can be inferred from the previous sentence. Please clarify that, or rephrase.
29. “With that problem in mind” --> What problem? Please, be more specific and guide the reader.
30. Your first research question: “Is a mapping between XML Schema and ShEx reachable?” --> “Is there a ...” This question is too narrow and can be answered with a “yes” or a “no”. Such type of questions should be avoided. Try formulating a less narrow question in the line of “What components should have a mapping from XML Schema to ShEx?” or something like that.
31. For all your research questions, it is missing a connection between them and a “follow up” to mention whether you could be able to answer them or not! This point is important, and a must to have in a journal paper.
32. It is unclear the difference between you second and third research questions. The questions do not need to be one after the other, you could add a small paragraph between them giving some example or further explanation.
33. “Therefore, a solution on how to make the conversion from XML Schema to ShEx is described in this paper.” --> “In this paper, we describe a solution on how to make the conversion from XML Schema to ShEx.” Avoid the use of passive voice for this type of sentences.
34. “Detailing how each element in” --> “We describe how each element in”
35. In the description of the sections, you either replace the commas by semi-colon or dots.
36. Why do you refer to a “possible set of mappings” in Section 4? Could you build up on that idea?

### Section 2

1. “Conversion to Semantic Web formats is a field that presents several previous works.” --> Rephrase, check the English, and conversion is a task not a field.
2. In general, the related work section still needs a lot of work. The authors do not position their work comparing against the state-of-the-art works. The related work should be better organized by, for example, building a taxonomy of existing approaches. Thus far, it looks very unconnected and loose.
3. “In [1] they try to” --> “In [1], the authors”
4. “XSLT stylesheets which, by” --> “XSLT stylesheets which by” (remove comma)
5. “could retrieve, query, merge and transform data” --> What is the difference between retrieve and query?
6. “Data validation is also a key question [12]” --> A key question for who and/or what area?
7. “On one hand Shapes Constraint Language (SHACL)” --> “On the one hand Shapes Constraint Language (SHACL)”

### Section 3

1. “It was one of the inputs for the W3C” --> “It was one of the foundations for the W3C”
2. “Nowadays, version 2.0 was released” --> use "Recently," instead. But it would be better to say something more precise like "In November 2017,"
3. “the working group is developing” --> “the working group is currently developing”

### Section 4

1. “XML Schema datatypes there is a” --> “XML Schema datatypes, there is a”
2. Listing 5 is never referenced
3. “As presented in the previous examples, when” --> “As presented in the examples in Listing 6, when”
4. “Complex types can be compound of different statements” --> “Complex types can be composed of different statements”
5. “transformation of each possibility below.” --> “transformation of each possibility in the following.”
6. Listing 7 is never referenced
7. “The following example shows how the mapping is done” --> “The example in Listing 8 shows how the mapping is done”
8. Listing 8 is never referenced
9. “Therefore, translation is performed as shown in the following snippet:” --> “Therefore, translation is performed as shown in the snippet of Listing 9:”
10. Listing 10 is never referenced
11. Listing 11 is never referenced
12. Listing 12 is never referenced
13. Listing 13 is never referenced
14. In Figure 1, increase the font of the graph for readability. Also, use the proper resource labels for Integer, List, Nil
15. Listing 14 is never referenced
16. Try to push the line in page 8 together with the full listing in page 7
17. “Enumeration restrictions use a base type” --> “Enumeration restriction uses a base type”. After you refer to Enumeration with “it”.
18. Listing 16 is never referenced
19. Listing 17 is never referenced
20. Listing 18 is never referenced
21. Listing 19 is never referenced
22. Listing 20 is never referenced
23. Listing 21 is never referenced
24. In Section 4.7.8, this mapping is not universal since it is only valid under the assumption of a local scope, i.e. XPath query ".". Please comment on this.
25. “Nowadays, ShEx does not support Unique function” --> “Currently, ShEx does not support Unique function”. May be add a footnote with the current version.
26. Listing 23 is never referenced

### Section 5

1. “a prototype has been developed that uses a subset” --> “a prototype has been developed. This prototype uses a subset”
2. “mappings and converts from a given XML Schema input to a ShEx output” --> “mappings and converts a given XML Schema input into a ShEx output”
3. “The example presented in Listing 24” --> “The input XML Schema document example presented in Listing 24”
4. “Therefore, complex types are converted to shapes” --> “Complex types are converted to shapes” Remove the therefore, this is not a deduction.
5. “as it is stated in the example conversion below.” --> In Listing 25?
6. “Therefore, translation of a valid XML to RDF” --> “The translation of a valid XML to RDF”
7. “This is done for avoiding to create a fictional node” --> “This is done in this way to avoid the creation of a fictional node”

### Section 6

1. “Using an existing validator helped to demonstrate that an XML and its corresponding XML Schema are still valid when they are converted to RDF and ShEx, although some ambiguity premises must be satisfied” --> Please explain the last part of the sentence.
2. “plausible using the advanced SHACL-Sparql features.” --> This is the first time that SHACL-SPARQL is mentioned. What are those features?

### References

1. References 4, 7, 10, 13, 16 are missing a source or URL
2. Reference 18, 21, 29 are missing the venue
3. Please uppercase the abbreviations such as RDF

Review #3
By Simon Steyskal submitted on 14/Dec/2017
Minor Revision
Review Comment:

The paper has been significantly revised and together with the authors' response accompanying the resubmission most of my raised remarks/questions were addressed. However, there are still a couple of open issues that need to be taken care of and again, you can find all my (handwritten) remarks as a scan at [1].

1) phrasing: Try to abstain from pompous statements like "We consider that this work [..] could drive to a new era of semantic-aware and interoperable data." if you don't elaborate on how your work contributes to that "paradigm shift" (i.e., why does data suddenly become "semantic-aware"?). Your work isn't any less justified without having those statements in it.

2) typos: see [1]

3) missing references: while all the listings now have captions and their formatting has significantly improved, a lot of them are still not explicitly referenced from in the text. E.g. in 4.3.2 Choice: "Therefore, translation is performed as shown in the following snippet: [Listing 9]" -> could be rephrased to "Therefore, translation is performed as shown in Listing 9:"

4) consistency: Section 1-3 use xsd:, Section 4-5 use xs:, Listing 24 uses xs: in the XML Schema part but xsd: in the ShEx part, Listing 25 uses xsd:, Shaclex uses xsd: -> why not use xsd: throught the entire paper?

5a) research questions 1-3: while you do address the very first RQ with the mappings provided in Section 4, I don't see how your work addresses RQ 2-3. Please provide a recap on how each RQ is addressed by what parts of your work in the conclusion.

5b) research question 4: "What are the conditions to ensure a valid conversion?" -> despite the fact that this RQ is very vaguely phrased (what conditions? what constitutes a "valid" conversion?), I've some serious issues with the paragraph that (I guess) is supposed to address RQ4 (p. 14):

"This kind of transformations can work in most of
the cases. However, there is a premise—which is in
line with one of the defined research questions—that
must be satisfied before generating a valid conversion.
In case of XML files with ambiguous content models
where some files can be transformed in different
ways and correct validation of converted data cannot
be guaranteed. This problem comes in two dimensions:
from XML to RDF, trying to maintain the same semantics
with different models; and for schema generation,
trying to create a schema that describes all the possibilities.
>>Nevertheless, if this ambiguity problem is previously
solved or is not present, the conversion can be
validated using the proposed techniques.<<"

So basically, you're addressing RQ4 by assuming that any ambiguities have already been resolved or aren't existing at all? That's not how it works.. Using the same line of argumentation one could write a paper about P vs NP having a RQ along the lines of "What are the conditions to ensure that P==NP" where said RQ is addressed by saying:

"Every problem whose solution can be verified in polynomial time
can also be solved in polynomial time can work in most of
the cases. However, there is a premise—which is in
line with one of the defined research questions—that
must be satisfied before P==NP holds.
After decades of studying NP-complete problems no one has been able
to find a polynomial-time algorithm for any of more than 3000 important known NP-complete problems.
Nevertheless, if such a polynomial-time algorithm for one of the NP-complete problem was found, we can
savely say that P==NP holds."

[1] https://github.com/simonstey/Reviews/blob/master/journal_swj/review_swj1...