Knowledge Base Quality Assessment Using Temporal Analysis

Tracking #: 1596-2808

Rifat Rashid
Giuseppe Rizzo
Nandana Mihindukulasooriya
Oscar Corcho

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Knowledge bases are nowadays essential components for any task that requires automation with some degrees of intelligence. The quality of such knowledge bases can drastically affect the decisions being taken by any algorithm, thus, for instance, affecting the classification of an email or the final policy maker choice. Establishing checks to ensure a high-level quality of the knowledge base content (i.e. data instances, relations, and classes) is at utmost importance. In this paper, we present a novel knowledge base quality assessment approach that relies on temporal analysis. The proposed approach compares consecutive knowledge base releases to compute quality measures that allow detecting quality issues. In particular, we considered four quality characteristics: Persistency, Historical Persistency, Consistency, and Completeness. The approach has been assessed both quantitatively and qualitatively on a series of releases from two knowledge bases, eleven releases of DBpedia and eight releases of 3cixty Nice. In particular, a prototype has been implemented using the R statistical platform. The capability of Persistency and Consistency characteristics to detect quality issues varies significantly between the two case studies. The Completeness characteristic is extremely effective and was able to achieve 95% precision in error detection. The proposed approach delivered good performances. The measures are based on simple operations that make the solution both flexible and scalable.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ruben Taelman submitted on 26/Apr/2017
Major Revision
Review Comment:

This paper describes a quality assessment approach for knowledge bases that evolve over time.
Four different quality characteristics were considered: Persistency, Historical Persistency, Consistency and Completeness.
The approach was implemented in R and evaluated using multiple DBpedia and 3cixty Nice releases.
The quantitative analysis illustrates the quality assessment approach for the 4 quality characteristics for the two knowledge bases.
The qualitative analysis consists of a manual study to verify the resulting quality assessment.

I definitely see the usefulness of this straightforward approach, especially because temporal quality assessment has not been done before.
I do however have several major concerns with this work (ordered by importance), some of which are acknowledged (ACK) by the authors.

1. The four quality metrics are based on aggregated information, which makes quality reporting less exact than it could be. (ACK)
The authors correctly acknowledge this concern, stating that by only looking at the entity counts,
essential information about raw triple additions and removals is lost.
By only looking at counts, you increase your chance on false positives.
For instance, if the entity count increases, instances may still be removed if more new entities were added.
This will lead to a false positive for at least the Persistency metric.
The usefulness of the work will increase a lot in my opinion if this is resolved.

2. Conclusions from the qualitative analysis are questionable, I am not convinced about the statistical significance.
For each quality metric, only one class (with multiple properties) were considered for each KB.
For Persistency, Historical Persistency and Consistency, only half of the KB resulted in a true positive.
This makes the usefulness of these metrics questionable.
More classes should be considered to make the analysis results statistically significant.
Completeness seems to be the only metric that results in a high level of precision for both KB.
But again, the statistical significance is questionable as only one class for each KB is considered.
Ideally, all classes for each KB should be considered, but this might not be feasible with the current manual qualitative analysis.
Perhaps different verification methods are possible?
Either the authors should correctly use the less-significant results in the conclusion,
or expand the analysis to make the results statistically significant.

3. The consistency metric is based on a predicate rarity threshold that should be manually set, and no guidelines for this are provided. (ACK)
The authors choose this value for a maximal precision, which depends on manual quality checks.
In reality, no such gold standard will be available, otherwise the quality assessment would not be needed.
The authors provide no guidelines to choose this threshold, which makes the usefulness of this questionable.
If the authors would be able to provide some heuristics for choosing a value for this threshold, this metric might be usable.

4. Linking for Persistency and Historical Persistency with existing metrics seem arbitrary.
According to the definition of Credibility and Trustworthiness,
I would expect this metric to have something to do with the level of trust you have on the data based on the authority.
Instead, this metric measures to what degree the knowledge base conforms to the growth assumption.
I do not consider this very related to the Credibility or Trustworthiness characteristics,
perhaps there are more better suited characteristics.
Historical Persistency:
This is linked to Efficiency and Performance, while this measures conformance to the growth assumption for the full history.
I am again not convinced that this has a lot to do with the degree to which a system can efficiency process the data.

5. Information growth assumption (ACK)
Persistency, Historical Peristency and Completeness are based on the information growth assumption,
which means that knowledge bases are expected to grow.
More concretely, facts are expected to be added, and not removed.
While most datasets do indeed mostly grow, removal of facts is still an occuring event.
DBpedia, which occasionally has schema changes across releases, will definitely have many removals of facts, which corresponding additions.
When only looking at the aggregated counts within the metrics, this will indeed be less of an issue,
but when fine-grained set comparisons are performed, this might cause problems when no change information between additions and removals is considered.
This is unlikely to be resolvable, since it is a fundamental assumption of this approach.

Issues 2 and 4 should be relatively easy to resolve.
Issues 1, 3 and 5 are however fundamental to the approach, which makes them more difficult to resolve without a major rework.
Furthermore, these issues make me question the usefulness of this approach for practical use cases.

Overall, the paper is nicely structured easy to read.
There are however several minor typo's and presentation issues which I will list hereafter:

1. The part about the 'final policy maker choice' is not clear to me. Who is this policy maker? What choice?
2. When reading the abstract, the difference between Persistency and Historical Persistency is not clear.
Perhaps each metric should be briefly introduced here?
3. Only the performance of the Completeness metric is mentioned, while the other metric performances are only mentioned vaguely.
I would make this more concrete.
4. 'The proposed approach delivered good performances': This is highly subjective and it is not mentioned in terms of what this performance is evaluated.
After reading the paper, I am also not convinced about the approach performing well.

1. '... can lead to detect and measure quality issues.': I would rephrase this.

-Related Work-
1. 2.1. 'sensible measures': Can these be listed?
2. What are 'user-denied scenarios'?
3. 'functional dependence rules' -> 'functional dependency rules'

-Quality Characteristics and Temporal Analysis-
1. Alignment in Table 1 seems to be wrong for the third row.
2. Difference between the definitions in row 2 and 3 is unclear to me, can this be rephrased?
3. 'a data quality issues is' -> 'data quality issues'
4. 'we can identify ... through data quality measure' -> '... measures'
5. I don't understand the point of this sentence: 'A data quality measure variable to which a value ...'
6. Footnote 3 should be DCAT or DQV?
7. 'even though it should not have been removed' How do you know this? Can more information about this be given?
8. 'Our approach focuses on ... subjects and predicates', I would simplify this explanation to simply classes and properties.
9. Each metric function has conflicting names. Persistency_i(C) and Persistency_i for instance should be called differently, PersistencyClass_i(C) and PersistencyKb_i for example.

-Temporal-based Quality Assessment Approach-
1. 'We also assume that each KB release is...': Can't you just say that it's an RDF KB?
2. 'p includes also the resource class belonging to a pre-defined ontology': What is meant by this? Can this be rephrased?
3. 'We perform statistical operations for each release...': At this point, it is not clear what kind of operations are meant by this.
4. 'For computing the change detection': What does this mean? 'Computing changes' or 'Detecting changes'? This occurs multiple times in the text.
5. 'We used statistical profiler...' -> 'We used the statistical profiler...'
6. An example quality report would be nice.

-Experimental Analysis-
1. The implementation section is difficult to read, can examples be given?
2. It is mentioned that each DBpedia release data is stored in a separate CSV file, while on GitHub, a different file structure is present.
3. There is some overlap between section 4 and 5.2 that could be resolved by restructuring the sections.
4. There are a lot of typos in 5.2, I recommend a proofread.
5. From 5.3.1 on, several URL's with prefixes are introduced.
I would not place the full URL's in the foonote, but introduce the prefixes somewhere at the start of the paper.
Especially because those prefixes were already uses earlier in the paper.
6. Table 3, why not consistenly use % or [0, 1]?
7. There is a lot of redundancy in the text of Table 3, this could be compressed a lot.
8. I would consistently use the Persistency, Historical Persistency, Consistency and Completeness in section 5.3.1.
9. 5.3.1: there is no conclusion for the Historical Persistency results.
10. Why are only two classes chosen for the 3cixty dataset? Is this statistically relevant?
11. I don't fully understand Figures 9 and 10.
Consistency_i(p, C) can either be 0 or 1, but in the figures this has a different range.
The frequency should be normalized, so the max value should be 1, right?
When printed in grayscale, the difference between the two lines is not clear, try choosing different colors and symbols.
12. Why is a different extraction process (Loupe [24]) used for DBpedia than 3cixty?
13. For DBpedia, the completeness measure was only applied to properties with Consistency = 1, why?
14. Table 10: the entries for Persistency and Historical Persistency are the same, is this intended/correct?
15. More information about the manual validations should be included in the paper or on the GitHub repo, because they are not reproducible now.
16. It is claimed that dbo:bnfId should not have been removed.
Is this certain? Were the DBpedia extraction framework developers contacted for this? Maybe this was intended?

1. Conformance is questionable because of concern 4.
2. Automation is questionable because of concern 3.
3. Performance is questionable because of concern 2. The Persistency metrics are not mentioned here at all?
4. It might be useful to mention in 6.2 that DBpedia Live also exists, which provides high-frequency updates.
5. The discussion about the KB growth assumption feels out of place.
It seems to be a prelimenary evaluation of future work.
6. The exclusion of the latest release in the linear regression seems odd. This means that the distance to the predicted value will always have a higher chance to be higher.
7. Typos in: 'From this measure we can implies...

Review #2
Anonymous submitted on 26/May/2017
Major Revision
Review Comment:

Very interesting work, well-writing paper, focusing on a very interesting topic.

However, I have one major point for which detailed clarifications are needed. The whole model proposed and presented, measures and characteristics, is based on counting instances.
Can we make the correct conclusions when, for example, computing the persistency of a class based only on the measure count? What about if some instances were added to a resource and some other deleted from an other resource, keeping the overall count value equal in two consecutive versions? The same comment holds for all characteristics.

A detailed description of the added value of the paper compared to [38] is needed. Please explain clearly your contributions compared to this work.

I think that the phrase "temporal analysis" on the title, does not reflect the content of the paper. Time is not an input parameter on your measures.

At page 8, you start talking about subjects and objects. A small intro in rdf is needed.

Experiments: At the beginning of this section, please add a table containing some basic statistics for the *input* datasets: size, number of triples, number of properties.
It seems that this section is quite big, with much overlaps. There are some results and then some discussions, but without containing much more new material and conclusions. I think that you can make shorter this part, and possibly add a short paragraph containing the lessons learnt. I found interesting the qualitative results.

Review #3
Anonymous submitted on 15/Sep/2017
Major Revision
Review Comment:

The paper presents a framework for the automatic quality assessment of KBs by measuring changes across subsequent releases and projecting them into respective quality metrics, namely persistency, historical persistency, consistency and completeness. Experiments for the validation of the proposed approach include quantitative and qualitative analysis over two KBs: 3cixty Nice, an application-specific ontology that is continuously updated through the addition of new assertions, and DBpedia, which captures encyclopaedic knowledge and is updated in batches, including both schema- and assertion-level changes.

The studied problem is a very interesting one that lies at the core of affording high quality KBs in the Semantic Web and Linked Data domains, while the proposed quality metrics reflect key quality considerations. However, in its current form, the presented framework lacks the depth and insights that would render the approach of immediate benefit to KB or Linked Data curators. Detailed comments follow.

Section 1
The approach and its underlying premises are well-motivated and of high relevance as changes and evolution are inherent to Linked Data. It should be outlined, as clearly as it is shown in the following sections, that the proposed quality metrics serve as indicators of possible issues (mentioning some examples would be also constructive), rather than as pointers of specific issues. Consider revising the use of the term “predicate” or providing a clarification, as before reading onwards it is not clear whether it refers to properties or property assertions. Also, would be good do mention that the proposed quality metrics draw upon the ISO/IEC 25012 and W3C DQV reference standards as well as the work by Zaveri et al., to give the starting context from the beginning.

Section 2
The related work section is fairly extensive and thorough. To avoid the impression of merely citing works and to assist the reader in grasping and positioning the presented approach within the current state of the art, summary tables at the end of each subsection would be of great value. A further structuring, especially of section 2.2, e.g. by grouping trust-related works versus others, would enhance readability further.

Sections 3-6
A first comment, which applies to the entire manuscript (see also below), is that the use of English needs to be considerably improved. E.g. “Consistency relates to a fact being inconsistent in a KB.” => “Consistency assesses whether inconsistent facts are included in the KB.” . The terminology used throughout the manuscript needs to be coherent and clear, and without depending on subsequent contents for inferring what the intended meaning is: subject, predicate and object are clear only when used within a triple context; it should also be clear when referring to properties and when to property assertions;. E.g. in the definition of completeness, the terms “resource” and “property” are mentioned, while earlier and later one, the terms “subject/entity” and ”predicate” respectively are used.
It is mentioned that that the presented work focuses on “Low-level” changes, according to the nomenclature in Papavasilieiu et al. What the distinction between “low-level” and “high-level” refers to should be explicit, and also mentioned in the Introduction as these is a key characteristic of the presented approach. It is not clear why the number of property assertions whose domain is an instance of a given class is called frequency, while the number of class assertions per class is called count; also, as aforementioned, that the term “predicate” is used to refer to property assertions. The conciseness metric by Zaveri et al. is not the same as the proposed consistency one, as the former copes with minimality and avoidance of redundancies, not with contradictions.

As far as the presented quality metrics and their interpretation are concerned, further insights are needed. Consistency is defined in a fairly loose way, merely based on the number of assertions of a given property that are present for the instances of a given class. Though such metric could be indicator for possible anomalies, these do not necessarily have to do with semantic inconsistency (e.g. could be simply the fact that certain properties inherently apply to subsets only of that class’ extension or that the class should be further specialised). More importantly, such metric overlooks inconsistencies directly ensuing from the propagation of domain and range constraints, which to an extent to be captured by the completeness metric. The qualitative analysis, which one would expect to give insights on the aforementioned, is not sufficiently detailed. For instance, for the inconsistent properties detected for lode:Event, which turned out to not be inconsistent, no comment is provided so as to understand why this happened (i.e. was the threshold too low? was it due to the nature of properties that could only apply to some lode:Event instances? etc.) Likewise for the DBpedia case. The same applies for the other measures (e.g. what do the disappearing instances account for? In the 3cixty lode:Event case, it was due to algorithmic reasons, but what about the DBpedia case?), while once again the loose usage of terminology raises ambiguities, e.g. “We first checked five subjects for manual evaluation.” followed by “For DBpedia we checked a total of 250 entities.” Do the “five subjects” refer to five distinct foaf:Person instances? What do the 250 entities amount to? Moreover, the selected classes/properties seem to few to draw insights and conclusions safely. Last, the provided manual validation results tables do not accurately summarise the observations.

The authors accurately identify the two main factors that may impact the validity of the proposed quality assessment approach, yet, given the repercussions of the two phenomena, it would have been expected to elaborate further on these in the qualitative analysis and discussion.
Last, the choice of the two KBs used for evaluation, which adhere to opposite evolution (continuous versus in batches) and domain (application specific versus generic) paradigms, is very interesting indeed. Yet, some more explicit feedback on the consequent differences and challenges with respect to quality assessment would be useful so to better appreciate the support, and differences in performance, observed by the proposed approach.

The manuscript needs to be carefully proofread; non exhaustive, examples of ambiguities/typos include:
page 4: “functional dependence rules”
page 5: “A data quality measure variable to which a value is assigned as the results of measurement of data quality characteristic”
page 8: “the degree to what extent a certain data quality characteristics”
page 11: “querying only on specifying the time of the release as constraint”
page 14: “based on each releases”; “This component divided into two modules”