S-Paths: Set-Based Visual Exploration of Linked Data Driven by Semantic Paths

Tracking #: 2195-3408

Marie Destandau
Caroline Appert
Emmanuel Pietriga

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Meaningful information about an RDF resource can be obtained not only by looking at its properties, but by putting it in the broader context of similar resources. Classic navigation paradigms on the Web of Data that employ a follow-your-nose strategy fail to provide such context, and put strong emphasis on first-level properties, forcing users to drill down in the graph one step at a time. We investigate a navigation strategy based on semantic paths and aggregation. Starting from sets of resources, we follow chains of triples (semantic paths) until we find properties that 1) provide meaningful descriptions of resources in those sets, and 2) are amenable to visual representation, considering a broad range of visualization techniques. We implement this approach in \spaths{}, a browsing tool for linked datasets that systematically tries to identify the most relevant view on a given resource set, leaving users free to switch to another resource set, or to get a different perspective on the same set by selecting other semantic paths to visualize.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Roberto García submitted on 08/Jul/2019
Major Revision
Review Comment:

The paper contributes a very interesting, and novel in many regards, approach to semantic datasets exploration. The main novelty is in considering property paths when exploring sets of resources, not just direct properties, and on a set of predefined visualizations that are automatically configured and selected based on their computed usefulness.

Despite the interesting contribution and the nice results visible in the online version of the tool, my recommendation is a "major revision" mainly because the paper stays too much at the overview level and provides little details about the contribution internals.

Regarding this issue, there is little information about how the tool interacts with the SPARQL endpoint. My understanding is that it is done just through SPARQL though the code seems to be linked to just the Virtuoso store. In any case, this makes it difficult for me to understand what SPARQL is generated for some queries.

For instance, trying to get the names of all female laureates between 1900 and 1919. I first focus on the first two decades displayed. Then, I change the category dimension to the genre of the laureate and focus just on females. If I then display the names of the laureates, I'm getting also some men. Though not evident from the feedback provided by the tool, this makes sense because I'm all the time filtering awards not laureates and there are awards with both male and female laureates. Thus, I should pivot to the set of laureates before filtering by genre. However, I'm just offered this option if I use the path to the genre through laureate and then pivot. I managed, at last, to get the desired result, though the experience was a little confusing because SPARQL DISTINCT might be operating at different levels and when displaying the laureates and their Nobel prize years, the second Nobel for Marie Curie disappears.

Other parts that require more detail are regarding what kind of processing is done to prepare the system when it is deployed on a new dataset, especially things like how entry-point classes are determined. In this regard, the proposed approach of directly jumping to the selected entry-point class seems too "narrow" and against Shneiderman's mantra about starting with an overview. What about using the treemap view to show the main classes in the dataset and let the user choose where to start from?

Finally, something that also requires more details, and maybe more effort from the part of the authors, is the part regarding evaluation. Right now, though it is well motivated and the questionnaire about what users learned is very appealing, a more rigorous approach is also required, especially regarding the effectiveness and efficiency of the tool. For instance, actually measuring the time it takes users to complete the tasks.

As mentioned, a more rigorous evaluation would require further work so maybe the best option for the current paper would be to extend the technical details about the contribution, as mentioned earlier, and keep the evaluation mainly as future work, with some preliminary results like those from the questionnaires. Then, for future work, it might be interesting to use evaluation results to compare the user experience provided by the tool to similar ones. One option might be the BESDUI benchmark (https://github.com/rhizomik/BESDUI), which in fact is a "cheap" approach to UI benchmarking because it doesn't require real-users involvement.

Review #2
By Agnieszka Lawrynowicz submitted on 04/Nov/2019
Major Revision
Review Comment:

The paper presents an approach and a tool for visual exploration of linked data, which provides visual representations of resource sets that help gain insights about those resources.
The authors gave a good overview of related works pointing that the other approaches do not make it possible to navigate linked data directly while benefitting from aggregation techniques, sub-selections etc.

+ important topic, fitting well in the scope of the Semantic Web Journal
+ good narrative
+ good idea
+ set-oriented exploration, such as identifying correlations, observing distributions, comparing & contrasting groups of resources
+ builds nicely on the experiences and results of previous works in the area
+ minimal effort from users before they can start browsing
+ S-Paths is distributed as an open-source project (requires registering/signing up to the git service of INRIA)

- some too much generic descriptions - lack of precise definitions, algorithms etc.
- some information is given only by means of examples and not by an exhaustive list
- the evaluation setup was not always clearly communicated to users

Further comments:


"amenable to visual representation" -> is this measurable?
"set-based navigation"-> it is not precisely defined in the paper what it is

***1. Introduction***
"Most linked data browsers employ a follow-yournose strategy" -> reference needed
"Properties that provide relevant descriptions of resources are not necessarily direct properties of those resources." -> any (quantitative) evidence? Examples?
"They can be several hops away in the RDF graph, depending on how abstract the dataset’s model is and on what ontologies it employs"-> can this be an artefact of serializing OWL to RDF?
Fig.1 contains very small images, hardly readable.

***3. S-Paths***
"S-Paths is designed to support users in the exploration of linked datasets"-> this sentence would benefit from the stating what S-Paths is (even if previously mentioned)

"semantic path" -> It is unclear whether a "semantic path" is being introduced by this paper, or if it has been introduced previously. In the former case, it would be better to provide a phrase like "in this paper we introduce semantic paths". In the latter case, there should be a citation or a section with preliminaries to delineate what is the contribution of the current paper with respect to previous work.

"A semantic path is a set of resources related to a set of values by a sequence of RDF statements." ->this definition is a bit vague, as it does not say in a detailed way about the relation, e.g. are RDF statements arbitrary? It would benefit from formalization

"S-Paths provides a collection of such templates"->where they can be found or a list of them? It would be useful to point to such list in this place of the manuscript or inform that it will be described later in the paper in Section X

"Considering paths that can be indirect, and not only first level properties, mechanically results in aggregation steps to the set of results."-> what does that mean that it *mechanically* results?

"The full analysis is performed only when S-Paths gets set up with a new set of graphs." ->what is a full analysis? The analysis with respect to all the characteristics?

"S-Paths provides a set of views: map, image gallery, timeline, statistical charts, simple node-link diagrams, etc." ->I recommend to remove "etc." and provide a full set of the views or a reference to the full set (e.g. included in a table).

"Once semantic paths for a given resource set have been characterized,"->it is unclear exactly how they are characterized, there haven't been any algorithm presented before this point of the paper, only a generic description

"These are used in multiple places in the interface, e.g., next to the resource selection menu, in the view configuration menu, in the axes’ legends, and whenever a semantic path is displayed."->again, I would prefer more exact description than only kind of an example (with use of "e.g.")

"They serve as entry points into the data, constituting what we consider a priori to be reasonably-coherent groups of entities."-> what is that the authors consider "reasonably-coherent groups of entities". This should be precised/formalized.

Fig.6 and Fig.7 are much too small in my opinion.

http://s-paths.net is down.

***4. Illustrative Scenario****

The images at Figure 8 are too small, non-readable.

***5. Evaluation***
In general, when users do exploratory search it is often a pre-requisite to solving some task.
The users were given tasks indicatively. At the same time the users were asked to explore the dataset in an open-ended fashion. There is perhaps some inconsistency or lack of explicit instructions on the goals with regard to the guidelines the users were given. Therefore I think that the quantitative results presented in "5.3. Task Success and Task Time" are not valid or reliable.

"They also had to tell whether they would have been able to answer those questions before the experiment."-> I am not sure this is the right setup.
I would expect such setup in which one group of the users is divided into two groups where one is answering the questions before and then the subgroups change the roles solving another, but similar configuration of the task, i.e. actually evaluating how well the task is solved with and without the proposed system and not asking how the users think they would solve the task. The baseline might be raw data?

***6. Limitations*

I appreciate that the authors provided the limitations section.

6.2. Data Processing: The authors claim they have tested S-Paths on several datasets, but it is not described what was the setup of the testing, and what was tested exactly.

Overall, I am rather positive about the paper, regarding its narrative, motivations, topic and contribution, which is very in line with the scope of the Semantic Web Journal.
However, I would expect more precise definitions (clear, even if not overly formal) and more rigorous style of writing (at places I have indicated in comments) and fixing a broken link to the demo.

Another major remark is that the paper might perhaphs be submitted to a category of "Reports on tools and systems"?

Review #3
Anonymous submitted on 06/Jan/2020
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

# Summary
This work addresses the problem of visual explorations of RDF graphs. In this context, the authors identify two main challenges: (i) identifying meaningful descriptions or properties of resources, which may be several hops away from the pivot nodes, and (ii) pruning irrelevant descriptions from a large pool of RDF triples or path candidates. To address these challenges, the authors present an approach based on semantic paths. The proposed solution automatically selects relevant paths as well as aggregations to visualize the path values. The visualization tool S-Paths was evaluated with a user study comprising a total of eight participants. The results indicate that the tool can be used by experts and lay users.

# Originality
The proposed solution presents a novel solution for dynamically selecting meaningful visualizations of sets resources in RDF graphs. Based on the characterizations of the semantic paths, the tool is able to select the most prominent properties and views to display the results.

# Significance of the results
The empirical evaluation of S-Paths is rather limited and there are some issues with the experimental design and the presentation of the results.

The number of participants in the evaluation is very low, especially considering that the users are classified into three types of users (publisher: 2, reuser: 2, layman: 4). Furthermore, the difficulty of the tasks included in the questionnaire used in the evaluation seems rather low (this is evidenced as one user already knew the answer to 3 out of 5 questions beforehand).

Regarding the results, the authors did not measure the usability of the tool while solving the tasks and rather present a descriptive report about the users’ experience. While these provide some interesting insights into the users’ perspectives, unfortunately, from this study, it is hard to derive significant conclusions about the effectiveness of S-Paths.

# Quality of writing
The paper is, in general, well written and easy to follow. The authors include examples to illustrate the relevant concepts. Still, the presentation could be improved by:

* Including more details in the description of the components of the approach (see Q1).

* Specifying the functionalities that should be supported by the SPARQL endpoint to support the requests from S-Paths.

* Increasing the size of the figures.

# Questions to the authors:
Q1. How does S-Paths handle path values with different representations? For example, path values may contain URIs and literals (e.g., see dbp:routesOfAdministration in DBpedia)
or different syntactic value that represents the same semantic value (e.g., 01-01-1930 vs. 1930).

Q2. Does S-Path take into consideration that some SPARQL endpoints have quotas (e.g. OpenLink Virtuoso) for the maximum execution time of a query or the maximum number of results returned in a query? Please note that these quotas may affect the values retrieved from the endpoint.

Q3. What is the average response time of S-Paths in the studied cases?

Q4. In page 5 (lines 40 and 47), what does it mean (1.8) and (1.9)?

Q5. In page 5 (line 45), what does it mean 1.8 -10 and 1.13-14?

Q6. In page 6 (line 45), what is the definition of a class with the ‘richest description’?

Q7. In page 12, (lines 21-23), the following sentence is unclear: “(...) we relied in the model to generate and store them in a small adjacent graph”. What is the model used in the case where RDF graphs do not have rdf:type statements?