Understanding the Structure of Knowledge Graphs with ABSTAT Profiles

Tracking #: 3082-4296

Blerina Spahiu
Matteo Palmonari
Renzo Arturo Alva Principe
Anisa Rula

Responsible editor: 
Guest Editors Interactive SW 2022

Submission type: 
Full Paper
While there has been a trend in the last decades for publishing large-scale and highly-interconnected Knowledge Graphs (KGs), their users often get overwhelmed by the daunting task of understanding their content as a result of their size and complexity. Data profiling approaches have been proposed to summarize large KGs into concise and meaningful representations, so that they can be better explored, processed, and managed. Profiles based on schema patterns represent each triple in a KG with its schema-level counterpart, thus covering the entire KG with profiles of considerable size. In this paper, we provide empirical evidence that profiles based on schema patterns, if explored with suitable mechanisms, can be useful to help users understand the content of big and complex KGs. We consider the ABSTAT framework, which provides concise pattern-based profiles and comes with faceted interfaces for profile exploration. Using this tool we present a user study based on query completion tasks, where we demonstrate that users who look at ABSTAT profiles formulate their queries better and faster than users browsing the ontology of the KGs, a pretty strong baseline considered that many KGs do not even come with a specific ontology that can be explored by the users. To the best of our knowledge, this is the first attempt to investigate the impact of profiling techniques on tasks related to a content understanding with a user study.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/Apr/2022
Major Revision
Review Comment:

The authors reported on ABSTAT profiles to facilitate SPARQL query formulation. The ABSTAT approach (and system) is used to help users understand a KG's structure. In their experimental setup, they compared users' performance that used ABSTAT with a group that used Web Protégé (their baseline). Somewhat unsurprising, the authors' data indicate that their approach does help users in formulating queries.

In terms of originality and relevance, the topic of this article is timely and relevant. Furthermore, the article reports on a rather extensive experiment of a system that was previously reported in other venues.

The results presented by the authors are important, but the authors could improve on two points:

First, it is unfortunate that the authors did not set up an experiment comparing ABSTAT, a baseline, and Loupe. The authors stated that Loupe was the only system that was comparable and available—because of this, comparing both would have likely yielded even more interesting data. The authors did not explain why they did not compare their system to Loupe for the experiment, and this limitation was not mentioned in the text. I understand it is challenging to redo an experiment, and I would not ask the authors to do this. The limitation needs to be adressed, however.

Secondly, the authors should have been more transparent about the experiment. For instance, the authors should have shared the questionnaires and instructions to facilitate the reproduction of the experiment. Ironically, the authors stated that sharing these is one of the contributions of the article, but they are not provided. I tried looking up the survey via the mailing list, but the survey is closed.

It would even allow others to conduct this experiment using different systems (and, in part, addressing the first point). Sharing more details about the experiment would have allowed me, as a reader, to understand some of the figures better.

For instance,
Fig. 8 contains rounding errors (very likely), or the question was not mandatory. I'm not convinced whether there's a difference between a Google Search for a DBpedia resource and the consultation of the Web page describing that resource. I presume people could have typed "dbpedia mountain" to look for that page. Does that count as a Google search?
The article implies that users were able to submit a query multiple times. Was that part of the survey, or was another form or tool used to assess the queries? Did users get feedback when submitting a query (semantics, syntax, correct/not correct)? Section 4.2 states that users needed to report how many times they attempted the query.
Did the instructions explicitly state that users were allowed to use other tools? Why not ask participants to avoid the use of other tools? The authors risk(ed) gathering skewed results, e.g., one participant using DBpedia's interface vs. one using YASGUI with autocompletion.

I am convinced the authors could improve the article if they provided more detail on the experiment and clarified some of their analyses. The latter will be mentioned in the list below.

Another issue with the article is the bias that the authors introduced. The authors decided not to provide a video for users that wanted to use Protégé, assuming that knowledge of ontologies and SPARQL was enough to know Protégé. One could say the same thing for ABSTAT. I would hope that the survey also enquired about prior knowledge of (Web) Protégé and ABSTAT, but the questionnaire is not available.

Overall, the article is well-written. There are quite a few typos, most of which are recurring. E.g.,
"In this section we..." instead of "In this section, we...".
Some sentences need to be rephrased.
The authors did not correctly conjugate some verbs.
Overuse of "moreover."
The text is dense. I would encourage the use of enumerations where possible. For instance, it allows the reader to go back and forth between points and images more easily.

I found that the authors should have mentioned the limitations earlier. They are now mentioned after the related work. These should have been discussed after 4.3, either as a separate section or as a section 4.4. There is also a difference between the lessons learned from this study, which can be part of the conclusions, and the lessons learned from the limitations, which can be part of future work.

How does ABSTAT deal with the following corner case: minimal type patterns where the most specific class can be one of several due to multiple class hierarchies. A paragraph on this could make the notion of "most specific" more concrete and the paper self-contained.
How do the authors define "good abstraction" (page 4 line 27)? Where have the authors reported the evidence?
It seems that the text describing Fig. 3 and the image do not correspond. I presume the list is covering dbo:Film and the numbers mentioned on page 5 are not to be found in Fig 3.
In Section 4.1, the authors analyzed the responses to the participants' self-assessment for Q3. The authors mention SPARQL, data modeling, DBpedia datasets, and ontologies, but Fig. 6 only provides data w.r.t. SPARQL. Why were the others left out? It also seems that the authors did not provide an answer to Q3.
Q1 and Q2 are yes/no questions. It would be interesting to focus on the how and why, however.
At the end of Section 4.2, the authors state that using ABSTAT is easy. The authors did not report on a quantitative usability analysis (SUS, PSSUQ, ...). There is also a difference between the comprehension of the profiles (which participants said could be made easier to understand) and the tool's ease of use. I do not believe that the authors adequately assessed the ease of use. I suggest rephrasing this statement. At best, the data indicates that ABSTAT is easy to use (or easier to use w.r.t. Web Protégé for the tasks).
Why did the authors choose people choosing the tool they wanted to use instead of randomly assigning people to either group? The latter is a best practice.

More detailed comments:
Rephrase the sentence starting on line 41 (2nd column).
The authors consider RDFS to be the "simplest" ontology language. This is rather subjective as others deem schema.org semantics more intuitive (I am referring to the semantics of multiple domain declarations, for instance).
Some footnotes refer to the same URL (e.g., 12 and 13).
Inconsistent use of "she/he", "s/he", and "the user."
The scale of Fig. 1 seems off.
The authors used informal speech, such as "very" in multiple places. E.g., "very big", "very few", and "very positive."
I am not sure that the use of "coherent" (page 9) is correct in this context.
While minor, it could be beneficial to add a totals column in Table 5 and make the link with table 2.
Rounding errors in Fig. 8?
Provide figures instead of "a lot of users" or "very few"

Review #2
By Evan Patton submitted on 16/Apr/2022
Minor Revision
Review Comment:

The authors introduce a tool called ABSTAT that generates patterns from an ontology and instance level data that cover the usage of types and predicates in the knowledge graph. The tool produces a number of statistics including occurrences of types, predicates and datatypes, frequency of patterns, instances of patterns, and min, average, and max cardinalities. The sources for the tool are linked from the group's website and are hosted on BitBucket.

The authors then present a between-users study that compares the performance of participants completing 3 SPARQL queries using ABSTAT versus WebProtege and the DBpedia ontology. The query complexity was specified as a function of number of triple patterns. Participants using ABSTAT were able to complete the task much faster for more complex queries. Participants expressed that they did not feel they needed to use other tools when using ABSTAT, whereas participants using WebProtege also used other resources such as the DBpedia documentation.

Overall, the results are highly promising and the paper is relevant to the special issue. It is good to see new tools being developed to aid information discovery in linked data and semantic web. That said, I have two reservations regarding the study design that the authors may want to address in the text.

First, participants self-selected into ABSTAT or WebProtege, but instruction on the tool was only given to the participants in the ABSTAT condition. This is dismissed as "we assume that users ... are familiar with SPARQL and ontologies." It has been a while since I have used Protege, but if memory serves an understanding of SPARQL is not a prerequisite for using it. The study should have included equal amounts of training regardless of tool chosen to ensure the best possible comparison between the two platforms. In fairness, the fact that some participants in the ABSTAT condition did not view the tutorials but still performed similarly well supports that ABSTAT is an easier tool to adopt.

The second issue is that the first half of the paper is spent comparing ABSTAT to another tool called Loupe, but then the study switches entirely over to comparing against WebProtege. While I can appreciate that WebProtege is more commonly used and therefore serves as a better baseline, the shift is jarring. The reader is left wondering if the Loupe output was as good as ABSTAT or not when employed in a similar task since the comparison switches from a more apples-to-apples comparison of two summarization tools to a summarization tool versus an ontology editor. The motivation could be better explained and a follow up study that includes Loupe as a condition would be informative. The decision to use WebProtege is explained but it doesn't necessarily justify dropping comparisons to Loupe.

Minor issues:

Page 6 column 2 line 26 has a spelling error of linkedbrianz instead of linkedbrainz. There were also a number of grammatical issues that could be addressed in the final manuscript.

The last paragraph of page 6 column 2 might be better summarized as a table, possibly with an improvement factor given that both ABSTAT and Loupe compression ratios have the same denominator. Possibly add this as another column in Table 1 but that table is already pretty busy. It may be worth splitting it into two tables, one that specifically highlights the compression ratio comparisons and a separate table that has the breakdown of statistics for the ontologies.

In table 1 having the scales be different on the ABSTAT vs Loupe comparisons makes it harder to compare the effects of these two approaches.

The intro paragraph for Section 3 read very disjointed. I was able to understand it after reading a few times (lots of commas and parentheticals). While I ultimately understand what you're arguing here some word smithing would make it easier for readers to follow in the future.

I found the text in Figs 5, 6 and Table 2 difficult to read. The authors may want to increase the font size for clarity.

I wonder if there is a better term than compression rate for what is accomplished by ABSTAT. Typically when one thinks of compression tools (gzip, etc.) the expectation is that the data will be recoverable in some capacity. Also, wouldn't a pattern (owl:Thing, rdf:type, owl:Class) effectively have a near 100% compression ratio in spite of the fact that it doesn't tell us much--assuming a fairly flat class hierarchy?

I am surprised by the use of a T-test for comparing the ABSTAT and WebProtege completion times. Typically, a T-test is used under an assumption of homogeneity of variance. However, given the disparity in the numbers I wonder if a homoscedasticity test was done and whether a non-parametric test like Mann-Whitney U would have been more appropriate here.

Review #3
Anonymous submitted on 21/Apr/2022
Minor Revision
Review Comment:

This paper presents a web-based, interactive tool for summary navigation of knowledge graphs. The system attempts to provide succinct, specific information on the most prominent relationships in the graph.

Paper strengths:

The ABSTAT approach is not novel to the paper, but the use in this interface seems original and a useful. The motivation for making it easier to understand and navigate especially very large knowledge graphs is compelling. Interfaces such as these help users both inexperienced and expert to make better use of what is sometimes hidden in the graph.

The description and explanation of the presentation of the system is for the most part very clea. The reasoning behind the design choices is well articulated, and overall I think the system merits interest.

The experiment setup has both strengths and weaknesses, but the strengths are more numerous. The sample size is suitable and large, the questionnaire was well-administered, and the statistical analysis presented insightful.

The weaknesses of the paper are as follows:
1. three tasks is a relatively small number of tasks
2. It's unfortunate that users only used one system, and no opportunity was presented for side-by-side comparison
3. I am curious about self-selection for the different systems, while the data do not seem to show a bias, is there any indication as to why users chose one or the other?
4. No reporting on performance of the system - how long does it take to display the content?

Overall I think this paper merits publication with a small number of revisions, which I mention below, in order of importance:
0. Please add some information on how fast it is for the page to load, especially compared to protege, and whether queries are realtime. This helps to understand the user experience, and the time metric in the evaluation.
1. I think that a discussion of the impact of reasoners and sub-class relationships would be useful. If the graph includes extensive subclasses, or relies on those relationships - what's the impact to the abstat approach?
2. I am slightly puzzled by the assertion at the start of section three that understanding is not well-tested, particularly as the methodology of approaching extrinsic understanding by applied tasks is what is used here too. I think that this should be rewritten to explain the relationship between being able to complete the chosen task and understanding in the graph.
3. Compression rate is a sensible and illustrative metric, but I am finding the notation (especially the ~0.003) to be a bit counterintuitive - could the reciprocal be better (3,000x reduction)

There are some typos and grammatical errors. While I don't think I caught them all, here are some:

1. P2 L40 a profile(s)
2. P3 L41 functionalities -> features
3. P4 L27 'much' helpful -> delete much
4. P9 L41 'take (in) average' -> delete in
5. P9 L44 describe -> add s
6. P10 L39 what is a 'minor time'?
7. P14 L25 why 'did' Protege user -> add did
8. P14 L27 Why 'was' exploring the ontology in Protégé
to answer to the queries (is) not enough? - add was, delete is.