Review Comment:
This paper describes a quality assessment approach for knowledge bases that evolve over time.
Four different quality characteristics were considered: Persistency, Historical Persistency, Consistency and Completeness.
The approach was implemented in R and evaluated using multiple DBpedia and 3cixty Nice releases.
The quantitative analysis illustrates the quality assessment approach for the 4 quality characteristics for the two knowledge bases.
The qualitative analysis consists of a manual study to verify the resulting quality assessment.
I definitely see the usefulness of this straightforward approach, especially because temporal quality assessment has not been done before.
I do however have several major concerns with this work (ordered by importance), some of which are acknowledged (ACK) by the authors.
1. The four quality metrics are based on aggregated information, which makes quality reporting less exact than it could be. (ACK)
The authors correctly acknowledge this concern, stating that by only looking at the entity counts,
essential information about raw triple additions and removals is lost.
By only looking at counts, you increase your chance on false positives.
For instance, if the entity count increases, instances may still be removed if more new entities were added.
This will lead to a false positive for at least the Persistency metric.
The usefulness of the work will increase a lot in my opinion if this is resolved.
2. Conclusions from the qualitative analysis are questionable, I am not convinced about the statistical significance.
For each quality metric, only one class (with multiple properties) were considered for each KB.
For Persistency, Historical Persistency and Consistency, only half of the KB resulted in a true positive.
This makes the usefulness of these metrics questionable.
More classes should be considered to make the analysis results statistically significant.
Completeness seems to be the only metric that results in a high level of precision for both KB.
But again, the statistical significance is questionable as only one class for each KB is considered.
Ideally, all classes for each KB should be considered, but this might not be feasible with the current manual qualitative analysis.
Perhaps different verification methods are possible?
Either the authors should correctly use the less-significant results in the conclusion,
or expand the analysis to make the results statistically significant.
3. The consistency metric is based on a predicate rarity threshold that should be manually set, and no guidelines for this are provided. (ACK)
The authors choose this value for a maximal precision, which depends on manual quality checks.
In reality, no such gold standard will be available, otherwise the quality assessment would not be needed.
The authors provide no guidelines to choose this threshold, which makes the usefulness of this questionable.
If the authors would be able to provide some heuristics for choosing a value for this threshold, this metric might be usable.
4. Linking for Persistency and Historical Persistency with existing metrics seem arbitrary.
Persistency:
According to the definition of Credibility and Trustworthiness,
I would expect this metric to have something to do with the level of trust you have on the data based on the authority.
Instead, this metric measures to what degree the knowledge base conforms to the growth assumption.
I do not consider this very related to the Credibility or Trustworthiness characteristics,
perhaps there are more better suited characteristics.
Historical Persistency:
This is linked to Efficiency and Performance, while this measures conformance to the growth assumption for the full history.
I am again not convinced that this has a lot to do with the degree to which a system can efficiency process the data.
5. Information growth assumption (ACK)
Persistency, Historical Peristency and Completeness are based on the information growth assumption,
which means that knowledge bases are expected to grow.
More concretely, facts are expected to be added, and not removed.
While most datasets do indeed mostly grow, removal of facts is still an occuring event.
DBpedia, which occasionally has schema changes across releases, will definitely have many removals of facts, which corresponding additions.
When only looking at the aggregated counts within the metrics, this will indeed be less of an issue,
but when fine-grained set comparisons are performed, this might cause problems when no change information between additions and removals is considered.
This is unlikely to be resolvable, since it is a fundamental assumption of this approach.
Issues 2 and 4 should be relatively easy to resolve.
Issues 1, 3 and 5 are however fundamental to the approach, which makes them more difficult to resolve without a major rework.
Furthermore, these issues make me question the usefulness of this approach for practical use cases.
Overall, the paper is nicely structured easy to read.
There are however several minor typo's and presentation issues which I will list hereafter:
-Abstract-
1. The part about the 'final policy maker choice' is not clear to me. Who is this policy maker? What choice?
2. When reading the abstract, the difference between Persistency and Historical Persistency is not clear.
Perhaps each metric should be briefly introduced here?
3. Only the performance of the Completeness metric is mentioned, while the other metric performances are only mentioned vaguely.
I would make this more concrete.
4. 'The proposed approach delivered good performances': This is highly subjective and it is not mentioned in terms of what this performance is evaluated.
After reading the paper, I am also not convinced about the approach performing well.
-Introduction-
1. '... can lead to detect and measure quality issues.': I would rephrase this.
-Related Work-
1. 2.1. 'sensible measures': Can these be listed?
2. What are 'user-denied scenarios'?
3. 'functional dependence rules' -> 'functional dependency rules'
-Quality Characteristics and Temporal Analysis-
1. Alignment in Table 1 seems to be wrong for the third row.
2. Difference between the definitions in row 2 and 3 is unclear to me, can this be rephrased?
3. 'a data quality issues is' -> 'data quality issues'
4. 'we can identify ... through data quality measure' -> '... measures'
5. I don't understand the point of this sentence: 'A data quality measure variable to which a value ...'
6. Footnote 3 should be DCAT or DQV?
7. 'even though it should not have been removed' How do you know this? Can more information about this be given?
8. 'Our approach focuses on ... subjects and predicates', I would simplify this explanation to simply classes and properties.
9. Each metric function has conflicting names. Persistency_i(C) and Persistency_i for instance should be called differently, PersistencyClass_i(C) and PersistencyKb_i for example.
-Temporal-based Quality Assessment Approach-
1. 'We also assume that each KB release is...': Can't you just say that it's an RDF KB?
2. 'p includes also the resource class belonging to a pre-defined ontology': What is meant by this? Can this be rephrased?
3. 'We perform statistical operations for each release...': At this point, it is not clear what kind of operations are meant by this.
4. 'For computing the change detection': What does this mean? 'Computing changes' or 'Detecting changes'? This occurs multiple times in the text.
5. 'We used statistical profiler...' -> 'We used the statistical profiler...'
6. An example quality report would be nice.
-Experimental Analysis-
1. The implementation section is difficult to read, can examples be given?
2. It is mentioned that each DBpedia release data is stored in a separate CSV file, while on GitHub, a different file structure is present.
3. There is some overlap between section 4 and 5.2 that could be resolved by restructuring the sections.
4. There are a lot of typos in 5.2, I recommend a proofread.
5. From 5.3.1 on, several URL's with prefixes are introduced.
I would not place the full URL's in the foonote, but introduce the prefixes somewhere at the start of the paper.
Especially because those prefixes were already uses earlier in the paper.
6. Table 3, why not consistenly use % or [0, 1]?
7. There is a lot of redundancy in the text of Table 3, this could be compressed a lot.
8. I would consistently use the Persistency, Historical Persistency, Consistency and Completeness in section 5.3.1.
9. 5.3.1: there is no conclusion for the Historical Persistency results.
10. Why are only two classes chosen for the 3cixty dataset? Is this statistically relevant?
11. I don't fully understand Figures 9 and 10.
Consistency_i(p, C) can either be 0 or 1, but in the figures this has a different range.
The frequency should be normalized, so the max value should be 1, right?
When printed in grayscale, the difference between the two lines is not clear, try choosing different colors and symbols.
12. Why is a different extraction process (Loupe [24]) used for DBpedia than 3cixty?
13. For DBpedia, the completeness measure was only applied to properties with Consistency = 1, why?
14. Table 10: the entries for Persistency and Historical Persistency are the same, is this intended/correct?
15. More information about the manual validations should be included in the paper or on the GitHub repo, because they are not reproducible now.
16. It is claimed that dbo:bnfId should not have been removed.
Is this certain? Were the DBpedia extraction framework developers contacted for this? Maybe this was intended?
-Discussion-
1. Conformance is questionable because of concern 4.
2. Automation is questionable because of concern 3.
3. Performance is questionable because of concern 2. The Persistency metrics are not mentioned here at all?
4. It might be useful to mention in 6.2 that DBpedia Live also exists, which provides high-frequency updates.
5. The discussion about the KB growth assumption feels out of place.
It seems to be a prelimenary evaluation of future work.
6. The exclusion of the latest release in the linear regression seems odd. This means that the distance to the predicted value will always have a higher chance to be higher.
7. Typos in: 'From this measure we can implies...
|