Linked Data Completeness: A Systematic Literature Review

Tracking #: 2265-3478

Authors: 
Subhi Issa
Onaopepo Adekunle
Fayçal Hamdi
Samira Si-said Cherfi
Michel Dumontier
Amrapali Zaveri

Responsible editor: 
Agnieszka Lawrynowicz

Submission type: 
Survey Article
Abstract: 
The quality of Linked Data is an important aspect to indicate their fitness for use in an application. Several quality dimensions are identified such as accuracy, completeness, timeliness, provenance, and accessibility, which are used to assess the quality. While many prior studies offer a landscape view of data quality dimensions, here we focus on presenting a systematic literature review for assessing the completeness of Linked Data. We gather existing approaches from the literature and analyze them qualitatively and quantitatively. In particular, we unify and formalize commonly used terminologies across 52 articles related to the completeness dimension of data quality and provide a comprehensive list of methodologies and metrics used to evaluate the different types of completeness. We identified seven types of completeness, including three types that were not previously identified in earlier surveys. We also analyzed nine different tools capable of assessing Linked Data completeness. The aim of this Systematic Literature Review is to provide researchers and data curators a comprehensive and deeper understanding of existing works on completeness and its properties, thereby encouraging further experimentation and development of new approaches focused on completeness as a data quality dimension of Linked Data.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Simon Razniewski submitted on 03/Sep/2019
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

This article presents a systematic literature review of linked data completeness. The topic is clearly highly relevant to the semantic web community, as a) completeness is a fundamental aspect of data quality, and b) it is even more challenging on the semantic web due to its reliance on pragmatic data gathering, and the open-world assumption. Also to the best of my knowledge, despite extensive research on that topic, no survey exists. As such I think the article is highly relevant to the community, and it is easy to check of condition (4) above.

Regarding criteria (2), coverage, the authors clearly did a comprehensive effort at gathering all relevant references, and I have only minor suggestions. The most relevant concern I have is how comprehensive the cited methods are discussed, I find the discussion in places much too short and abstract to be easily understood (this relates also to aspect (1), the suitability as introductory text). I see that an extensive coverage of all 52 articles is infeasible, so my suggestion would be, maybe for each dimension, to pick 2-3 representative and diverse articles and discuss them in more detail, then treat the others in short as now.

A related concern I have is about the classification scheme used by the authors. The distinction into schema completeness/property completeness/... is important for the understanding of the topic, and as such needs its own introduction, possibly before the survey methodology (Section 3). I appreciate the current attempt to illustrate each dimension at the time it is introduced with an example, but currently this is much too fragmented to allow understanding of the big picture, and how these dimensions relate or what distinguishes them.

I think the effort to expand the article as per my suggestions is moderate, and would judge this somewhere in between a minor and major revision suggestion. To avoid issues with the two-strike-rule I vote "major revision".

Detailed comments (with page, column and line number):
1.r.42: "By focusing on completeness, we can explore completeness" - Meaning of sentence unclear
1.r.49: Fix grammar
2.l.15: Introduce abbreviation
2.l.21: "Our objective is to analyze .... methods that assess or improve completeness" - the "improving" term gives me some headache due to the huge number of latent KG completion papers (TransE, TransH, Rescal, Distmult, ...). It may be worth to mention these explicitly in the exclusion criteria (I think it is reasonable to say that while they are about internally interlinked data, KGs, these are not truly about semantic web data)
2.l.33: Add [1]
2.l.49: Replace "identifying" with "assessing"
2.l.50: I don't understand what "however" contrasts to
2.r.4: Why subjective? Although it takes some efforts, consumer needs can be formalized (see e.g. [2]), and the nature of measures is that they should be objective, or am I missing something?
2.r.6: "It is" - "It can be"?
2.r.19-2.r.40: This needs a comprehensive example, as suggested above, probably an own section
2.r.38: Currency is a bit surprising here as it is commonly seen as another dimension - maybe is motivated later.
2.r.40-44: Something went wrong here - the Open-world assumption is at the core of the semantic web, and if one assumes its counterpart, the closed-world assumption, then there is no question of completeness, since under the closed-world assumption, by definition, anything not in the data is false.... (for settings in between the two, see partial closed-world assumption)
3.l.42: "such as." - what follows are samples, or a complete listing?
3.r.26: "ensure complete coverage" - that's a bit bold and conflicts with the thresholding that comes next. I would say more realistic is "to maximize coverage"
3.r.31: "the majority of results are not interesting after the first 200" - weak argument - even among the first 200 the majority might not be interesting, what would be an argument would be the ratio of relevant ones within and after the first 200..
3.r.34: "the most related 200 articles" - how was relatedness determined?
Figure 1: Steps 4-6 seem to be missing in the text? Also, the figure needs numbers next to each step, how many articles survived each step from 6583 down to 52
4.l.51: Based on the quantitative definition employed in 2.l.43 I wonder whether it would be important to have also "coverage" and "recall" as alternatives to the boolean term "completeness"
4.l: Odd spacing
5.l.37: "over represented"
5.r.34: Style: "A dataset is considered *schema*-complete, if ...", or "the schema of a dataset is considered complete, if ..."
5.r.39: "the" -> "a"
5.r.44: As per above definition, the use case needs to be spelled out too. Also formulation does not bring out well enough that it is meant that "capital" does not exist in the schema, not only is missing a value for the instance "Paris".

!!5.r.50ff exemplifies an important point from above: The discussion of the actual work is here much too shallow - "Several articles address the challenge of schema completeness [52, 57, 60]." The following sentences are a bit more detailed, but still too abstract to be helpful for novice readers. Also the separation into subheadings "problems" and "approaches and metrics" seems not to work well - most of "problems" seems to actually talk about approaches (exemplified by verbs like "developed", "investigated", "built").
I don't have a silver-bullet proposal on how to organize this better, but a way might be to describe 2-3 exemplary and diverse problems and approaches first, then mention the other works in shorter form as now. The discussion should also refer back to the running example.
6.l.35: "The common research challenge is to develop models, metrics or tools" - that's about everything that is ever possible....again, a much more concrete exemplification is needed first

Figure 4: Add abbreviations. Also, CEUR-WS is not a venue/event, but just a hosting portal for many different venues/events.
8.l.51: "property completeness is the ratio of the number of properties to ..." - that sounds like schema completeness? Shouldn't this be about instances?
9.r.48: Style. It's not the instance "France" that suffers from a population completeness problem, but the dataset
10.l.1: Fix grammar
10.l.6: Same here as before, the idea to separate problems and approaches and metrics is ambitious, but does not seem to work.
10.l.44: "Refers to percentage of entities that are interlinked in a dataset" - Odd definition, if one link is there per entity, but in reality each has many more, this definition would not capture it. Is this intended to refer to disambiguation w.r.t. a reference dataset?
10.l.45: Example makes unclear reference to "other dataset"
12.l.10: What are non-information resources?
12: Which of these tools are online?
13.r.39: "We identified 23 metrics" - I seem to have missed that? Where are they described?
13.r.44: fix grammar
13.r.46: "only six tools are automatic" - why is being semi-automatic bad?
Table 4: Very good effort! But needs discussion in text.
14.l.42: Completeness is relevant for completeness? Cyclic argument?
14.l.50: Fix grammar
14.r.46: Can't follow this, see earlier comment on OWA and CWA
15.l.51: "tag their articles with type of completeness" - This seems unrealistic to me. Also, I'm not sure a strict separation into categories is sensible, as more often than not, reality turns out more messy than cleanly-defined X dimensions of completeness...
15.r.28: Please critically check statements also when they come from a seemingly credible source - the GDP of the USA is just 20 trillion.
- References: 73-78 seem to be errors?
- Suggested additional references: [3,4,5]

[1] Paulheim, Heiko. "Knowledge graph refinement: A survey of approaches and evaluation methods." Semantic web 8.3 (2017): 489-508.
[2] Hopkinson, Andrew, et al. "Demand-Weighted Completeness Prediction for a Knowledge Base." NAACL 2018.
[3] Luggen, Michael, et al. "Non-Parametric Class Completeness Estimators for Collaborative Knowledge Graphs—The Case of Wikidata." ISWC 2019
[4] Galárraga, Luis, et al. "Predicting completeness in knowledge bases." Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017.
[5] Soulet, Arnaud, et al. "Representativeness of Knowledge Bases with the Generalized Benford’s Law." International Semantic Web Conference. Springer, Cham, 2018.

Review #2
Anonymous submitted on 15/Oct/2019
Suggestion:
Reject
Review Comment:

This paper is a systematic literature review on Linked Data completeness. There were seven types of completeness identified (schema, property, population, interlinking), including three that were not previously identified (currency, metadata and labelling).

delta with previous journal on Linked Data quality
I am not sure if a dedicate survey is required on completeness or it would make more sense to have a revisited version of the original survey on Linked Data quality. Namely, I am not confident that the delta is significant and a posteriori of the original Linked Data quality survey. In the end, the paper from the literature that contribute to these new types constitute less than 10% of the total number of identified papers while the majority of the literature is repeated.

lack of analysis
The paper claims that it gathers existing approaches from literature and analyse them qualitatively and quantitatively.

However, in most cases certain works are mentioned to contribute to a certain type of completeness, but it is not clarified how. For instance,
“[48] investigate internationalisation of knowledge bases and proposed that all entities in a knowledge base have human readable labels. They also explored development of a metric for labelling completeness” but it is not mentioned neither the approach nor what these metrics are. This gives me the feeling that this is more a summarisation paper, rather than a survey paper. There are no comments or any insights about the state of the art that is presented.

I checked the call of the semantic web journal to find out what is expected from a journal article:
> full-length papers surveying the state of the art of topics central to the journal's scope. These submissions will be reviewed along the following dimensions:
(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
(2) How comprehensive and how balanced is the presentation and coverage.
(3) Readability and clarity of the presentation.
(4) Importance of the covered material to the broader Semantic Web community.

(2) and (3) are well covered but (1) snd (4) might be questioned. The paper lacks the details level that I would expect to consider it introductory text. The covered material is indeed important but the way it is presented, I am not sure if the paper offers further isights to the semantic web community.

I looked what ACM mentions at its survey journals (being more specialised on surveys), and they mention:
> A survey article assumes a general knowledge of the area; it emphasizes the classification of the existing literature, developing a perspective on the area, and evaluating trends.
From that, I only see the former part being actually explored in this survey paper. Neither classification nor trends were presented.

one-dimension problem
> “by focusing on an individual dimension, we can explore certain aspects of detecting and mitigating the completeness data quality issue in a more thorough fashion”
It is not clear where the authors rely to assume this. Do you think completeness is a one-dimension problem? Don’t we miss context if we focus on only one dimension? In the discussion section, it is actually mentioned that it is rare to find research effort focused entirely on a single aspect of data quality dimension. Is it then safe to assume that looking into a single dimension is actually profitable?

The types of completeness are mentioned but not mentioned what each one is.
The new types are not introduced beyond barely being mentioned. Both old and new are clarified in sec 3.2. However, all types of completeness are used in sec 3.1 to classify the papers, without being explained how it is decided on which category each article belongs to.

definitions
In a survey paper definitions are introduced, but I assume that these definitions rely on certain conclusions which are derived based on the study of the literature and the definition is built on top of these conclusions whereas certain aspects remain open because literature might not be complete. The definitions in this survey paper seem to be adhoc, introduced by the authors, and there is no evidence that they align with the state of the art.

scope the keywords better
I think that the keywords could be scoped better. That might have helped to come up with a more diverse set of papers. For instance, why not “Knowledge Graphs” or “RDF”, eg “RDF dataset” “RDF graph”? Linked Data is a Knowledge Graph that is often represented in RDF. RDF is the most common way of representing Linked Data and a lot of quality assessment methodologies were performed relying on the properties of RDF without generalising their conclusions for the broader notion of Linked Data. I think significant papers might be missing from the list now.

incorrect interpretation of the figure
> “Figure 4 shows where researchers are publishing their work, where the top journal is Journal of Web Semantics and the top conference for publishing being European Semantic Web Conference."
But Figure 4 shows that most papers are published at Semantic Web Journal and not Journal of Web Semantics. The Journal of Web Semantics does not appear at all in the Figure. Moreover, ESWC is not the top conference. According to the figure, ESWC has as many papers as ICWE.

conferences Vs journals
> researchers are now publishing more in conferences with 29 articles (54%) published at a conference even though no one conference seems to be over represented"
What is now? Do you mean 2019? And more compared to when? And before there was a conference picked as top, which conflicts with the current statement that there is no conference overrepresented. Could that be interpreted differently? For instance that the domain is not mature enough?

identified papers
Each time the subsection approaches and metrics starts, it is mentioned how many papers talk about this aspect of completeness. I would suggest to cite them each time, so the text is self-standing and one should not depend on the table to find this information (the accompanying table is meant to accompany, can’t substitute the actual text). In the same context, it would be nice to also have the total number of papers talking about each aspect as a last row on the table. Of course, these are style refinements, the most important question is why the reader should care about those numbers?

identified Vs mentioned papers
Even more important, those numbers do not typically align with the papers that are cited. For instance, in the case of interlinking completeness, it is mentioned that there are 11 articles identified but in the end within the paragraph there are only 6 articles cited! The same holds with the rest types of completeness. Why only a few of the articles are cited? What about the rest?

problems Vs approaches
I am not sure if the problems Vs approaches & metrics paragraphs have anything different. For instance, in the problems paragraph of interlinking completeness, the approach followed by [15] is presented but not the problems. Then why is [15] mentioned under problems and not under approach? And this is only an example, it happens in most cases.

no concrete problems nor metrics mentioned
In most cases, I do not see neither problems nor metrics explicitly mentioned. For instance, in the case of population completeness, it is mentioned that there are 14 articles identified, but not one metric is mentioned. As far as the approaches is concerned, it is typically mentioned that there is _a_ framework but it is not explained which framework or what the idea is.
For instance, “[15, 16, 20, 41, 44, 52, 55, 57] all proposed novel models, methodology and/or metrics for measuring property completeness as part of general data quality assessment.” What did they propose? In the end this is not specified.
There is a table (Table 4) in the end that summarises certain information that is not aligned with the text and the table is not mentioned in the text.

metrics
In the discussion section the paper concludes that there were 23 metrics identified however those 23 metrics were not clearly indicated earlier.

relatedness to the survey
On certain other cases, it is never explained how the article that is mentioned is related to the survey. For instance, [30] is focused on disambiguation and resolving entity coreference but it is not explained how it becomes relevant to population completeness.

mismatch of metrics per type & per tool
There are certain metrics and approaches mentioned under each tool but these are not mentioned earlier even though the completeness types and the tools rely on the same set of referencing papers.

discussion
The overview of the discussion section reads more like statistical analysis of the tools rather than actual discussion of the observations, while the paragraphs on maintenance and stream-lining future surveys read more like a manual, not even guidelines, rather than actual discussion.

I debated a lot on my suggested decision. It is very interesting material but considering all the above comments, I don't think that a major revision would be enough and more radical changes need to happen before the paper is in shape to be considered for revisions. I would suggest then the paper to get rejected and resubmitted.

Minors

What does the abbreviation “SLR” stand for? It is first mentioned at page 2 line 15 but it is only introduced at line 27 Introduce the full before the abbreviation and use the abbreviation later on. If the abbreviation is not going to be reused, then why is it introduced?

Not sure why [7] is cited at page 2, line 28

According to the exclusion criteria, papers which are not peer reviewed should not be included. However, the paper by Snap and Micheilfeit on “Linked Data Aggregation Algorithm: Increasing Completeness and Consistency of Data” is not peer reviewed but it is still included in the list.

[53] —> the reference is wrong

It is not clear what Fig 2 tries to show. I think the visualisation type that was used is not adequate for the message it was desired to be communicated.

Typo: “In [46], the authored gathered” —> the authors gathered? (A couple more typos exist)

It is not explicitly mentioned in the text what type of completeness is addressed by slint+

I am not sure if the figure with the tools comparison is more descriptive compared to a corresponding table.

“This type is related especially to Linked Data.” —> the other types are not related to Linked Data? I thought the paper was about Linked Data completeness

On property completeness, So if a property does have a value, but the value is invalid, is it considered among the complete properties?

Table 4 is included in the paper but not linked in the text

Review #3
Anonymous submitted on 19/Dec/2019
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

(1). The paper reads really well, explaining all the necessary concepts on the go. It is concise and to the point. I believe the paper would benefit from a clear discussion why "Linked Data completeness" is not a contradiction in terms. The authors in Section 1.1 acknowledge that one cannot assume OWA, but a further discussion would be welcome. In particular, why LOD completeness is different from database completeness and warrants a separate review.

(2). The applied methodology is sound and the considered set of papers counts 52 positions. There is a clearly described algorithm how the paper were selected. My only problem here is that the authors acknowledge in Section 3.2.1 that "schema completeness" is also called "ontology completeness", which suggests that some papers were excluded by not including this term during the search. Could you please discuss that?

(3). Overall the paper reads well, but there are some details that I would like to see improved:
* Rotated text in Figure 1 is pain to read.
* Figure 2 is pretty, but in my opinion useless. I recommend adding reference numbers inside circles representing papers, this way one can immediately read which papers are concerned with particular type of completeness.
* Table 4 is very far from Section 3.2, where it is mentioned.
* Figure 3 is missing years 2006 and 2019. Even if there were 0 papers from these years, please include them on the chart for completeness.
* Figure 4 is not very interesting. The chart itself is very simple, the names of venues are typesetted in a small, hard to read font. I think a list/a table would be much more readable.
Also, "CEUR Workshop Proceedings" is not a conference or a journal. In my opinion, papers published there should be labeled with the workshops where they were presented.
* Section 3.2.2, page 8: "measurement-theoretic metrics". This probably makes sense in the context of the original paper, but here it sounds strange.
* The example in Section 3.2.3 says that "the instance France suffers from a population completeness problem", but the definition and the introductory text earlier in this section claim that population completeness is a quality of a dataset, not a particular instance. If I understand correctly, a dataset suffers from a population completeness problem if it does not have all the French cities.
* Table 3 I recommend sorting it according to the completeness types, instead of according reference numbers, to display clusters of papers dealing with schema completeness, property completeness etc.
* Section 3.2.4: My intuition and the example both claim that interlinking completeness is a quality of (at least) two datasets, whereas the definition talks about only a single dataset. I think this should be clarified.
* Section 3.2.5: "For instance, energy consumed or temperature in France is not the same in the last week or month" I find this sentence quite hard to follow and convoluted. Perhaps it could be simplified?

(4) I find the topic interesting and important for the community. This is also underlined by Figure 3 of the paper demonstrating that the topic of LOD completeness is on the rise.