Path based and triplification approaches to mapping data into RDF: user behaviours and recommendations

Tracking #: 3496-4710

Paul Warren
Paul Mulholland
Enrico Daga
Luigi Asprino

Responsible editor: 
Armin Haller

Submission type: 
Full Paper
Mapping complex structured data to RDF requires a clear understanding of the data, but also a clear understanding of the paradigm used by the mapping tool. We illustrate this with an empirical study comparing two different mapping para-digms from the perspective of usability, in particular from the perspective of user errors. One paradigm uses path descriptions, e.g. JSONPath or XPath, to access data elements; the other uses a default triplification which can be queried, e.g. with SPARQL. As an example of the former, the study used YARRRML, to map from CSV, JSON and XML to RDF. As an exam-ple of the latter, the study used an extension of SPARQL, SPARQL Anything, to query the same data and CONSTRUCT a set of triples. Our study was a qualitative one, based on observing the kinds of errors made by participants using the two para-digms with identical mapping tasks, and using a grounded approach to categorize these errors. Whilst there are difficulties common to the two paradigms, there are also difficulties specific to each paradigm. For each paradigm, we present recom-mendations which help ensure that the mapping code is consistent with the data and the desired RDF. We propose future de-velopments to reduce the difficulty users experience with YARRRML and SPARQL Anything. We also make some general recommendations about the future development of mapping tools and techniques. Finally, we propose some research questions for future investigation.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sergio Rodriguez Mendez submitted on 23/Aug/2023
Minor Revision
Review Comment:

* Summary:
The manuscript presents an empirical user behaviour study that compares two different RDF mapping paradigms: (1) path-based (focusing on YARRRML) and (2) triplification (focusing on SPARQL Anything).
The paper presented an analysis of the common difficulties found in the study, along with recommendations and proposed future improvements to reduce the difficulty users experience.

* Overall Evaluation (ranging from 0-10):
[Q]+ Quality: 10
[R]+ Importance/Relevance: 6
[I]+ Impact: 8
[N]+ Novelty: 8
[S]+ Stability: 9
[U]+ Usefulness: 9
[W]+ Clarity, illustration, and readability: 10
[P]+ Impression score: 8

* Dimensions for research contributions (ranging from 0-10):
(1) Originality (QRN): 8
(2) Significance of the results (ISU): 8.7
(3) Quality of writing (QWP): 9.3

* Overall Impression (1,2,3): 86.67
* Suggested Decision: [Minor Revision]

* General comments:
- I suggest the following updates on foot page 14 (page 14):
| + First sentence: change "sub-elements and a literal" to "sub-elements and textual content (literal)". This description is more technically sound from the XML perspective (as you did on foot page 15).
| + Add a sentence with the following content (or similar): "In general, SPARQL Anything treats XML CDATA sections as literals, and XML comments and processing instructions are ignored."
- pag#17 / Sec.#7.2: "In the mapping lowerHierarchyMapping (Figure 10, left-hand column)" # Figure 11
- pag#26 / Sec.#12 / At the end: the possibility of incorporating a query language for JSON within SPARQL. And how about XQuery for XML?

* Feedback:
Thank you for addressing the comments and highly improving the manuscript's readability.
Additionally, with the paper's restructuring and new sections, the details presented are deeply compelling and enjoyable to read.

* Minor corrections:
- For all the figures (JSON, XML, RDF, SPARQL code), a fixed-size font family such as "Courier New" would be preferable. It would increase the manuscript's presentation.
- pag#17 / Sec.#7.1 / Last paragraph: "iterator were, in fact,implied by [25]" # space after comma
- table#2 cuts the reading flow of the two-column format for that page (#18). It's a formatting issue.
- pag#18 / Sec.#7.5 / 1st paragraph: "artistMapping" ——> should be in italics.
- table#4 / last cell: "Data where ability to explore the data..." ——> "In use cases where the ability to explore data..."
- pag#21 / Sec.#9: "which avoids the imtermediate stage" # intermediate

Review #2
By Ehsan Emamirad submitted on 29/Aug/2023
Review Comment:

I thank the authors for the revision and additional efforts made in the resubmission.
I am comfortable with the resubmission considering the changes made to the paper significantly uplift the quality of the findings along with the technical narrative presented. Particularly in section 9, the differences between the two approaches are clearly identified and discussed in depth, bringing the value of the user behavior study into the light.
The comparative analysis is satisfactory in establishing the pros and cons of each method in reference to the particular use case. The polishes made to the paper also increase the readability of the paper, ensuring that a clear discussion about the methodology is present and the text easily corresponds to the figures.

Review #3
Anonymous submitted on 12/Oct/2023
Review Comment:

This paper presents a study with approaches for mapping data into RDF. I know the difficulty of performing such a study with users and how much time is required, however, in my opinion the current version of the paper cannot be accepted in Semantic Web Journal. In particular, the novelty and the contributions are not clear to me, and the paper seems to be more of a technical report. Moreover, the presentation needs several improvements in most of the parts of the paper and the structure should be reorganized. For the mentioned reasons (as it is explained in more detail below), my decision for the current version of this paper is “reject”.

Strong points

S1. The study can be of primary importance for the users of YARRRML and SPARQL Anything.
S2. A real user study that needs a lot of time for preparation and can be time consuming whereas user behaviors are described and several recommendations are given
S3. There is a Long-term Stable Link to Resources, which contains all the required details and files of the study.

Weak points

W1. The novelty and contributions are not clear
W2. The presentation and the writing of the paper needs to be improved.
W3. The paper is too dense, containing so many details and it seems to be more of a technical report than a full research paper.
W4. The participants are few, a better option could be at least the same participants to have done all the tasks for both systems.
W5. The structure of the paper needs to be revised, several pieces of information are repeated, and in many sections too many details are given.

Abstract and Introduction

First, please use present instead of past (e.g., the study uses, not the study used). Concerning the abstract, please also mention in the beginning why it is important to map such unstructured data to RDF. Therefore, please be clear in the abstract, describe (in brief) the problem, the motivation and the contributions.
Regarding the introduction do the same, i.e., provide more details why it is important to map such sources, and the motivation. There is no need to give so many details about the different techniques (YARRRML and SPARQL Anything) in the intro. Instead, you can add a subsection (e.g., in Section 2) having a background for each of these techniques.

Moreover, there are no research questions in this paper (only for future work), the contribution and the novelty (now, it is not clear), whereas in my opinion it is not acceptable to say “we believe it to represent the state-of-the-art”, i.e., if there is such a reference stating that, you can add it.

Related Work

Here, add a background for the techniques that you use in the study. Moreover, add in the end of the section a section for comparing your work with the related approaches, for highlighting the novelty of the presented work.

Section 3

This section needs to be fully revised. I appreciate how difficult and time consuming it is to perform such a study with real users, however, please use figures and tables for representing all the statistics about the users, in the current version it is quite difficult for the reader. Or at least use some bullets.

Section 4

Similarly to Section 3, again there is too much information in the text. One solution is to extend Table 2 by adding more information for each question, e.g., to include the objective and a link to the corresponding figure.

Sections 5 and 6.

Again the text is too dense, it is helpful that you have figures with the solutions, however, in my opinion the best way is to include some comments in some lines of the figures instead of writing so many details in the text, e.g., in Section 5, from “Both Mappings … changing the effect”, you can describe all these details by adding some comments in the right side of the code of Figure 9.
The same holds for the other figures

Sections 7 and 8

It is nice that you analyze the errors, but again the section should be revised to include only the required information, since most of the information included in those sections are also described in Tables 2 and 3. So in the next version of the paper, please provide a smaller version of the text and keep the Tables.

Section 9
The two different paradigms should have been compared earlier in this paper, e.g., in a background section (e.g., before related work).

Section 10.

In my opinion the same users should have done both parts of the study (for both YARRRML and SPARQL Anything). Moreover, 9 participants is a small number although the same repeated problems can occur. Table 5 should be moved to section 3.

Section 11.

Concerning the recommendations, in my opinion they can be helpful for the users that use the studied techniques. However, again the text is too dense, you can just provide a table showing the key recommendations and future developments.

Section 12.

The conclusion is quite big and repeats the same things from the previous sections. Please be more precise, just add 2-3 paragraphs with the most important parts of the paper and the final conclusions. Moreover, I do not agree with having the research questions for the future study in the conclusion section. One solution could be to have a subsection in Section 11, stating that by performing the study, we discovered more research questions for the future.

Other Minor issues

have tag item → have tag items
Uses of SPARQL Anything. →Users of SPARQL Anything