A Quality Assessment Approach for Evolving Knowledge Bases

Tracking #: 1795-3008

This paper is currently under review
Rifat Rashid
Marco Torchiano
Giuseppe Rizzo
Nandana Mihindukulasooriya
Oscar Corcho

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Knowledge bases are nowadays essential components for any task that requires automation with some degrees of intelligence.Assessing the quality of a Knowledge Base (KB) is a complex task as it often means measuring the quality of structured information, ontologies and vocabularies, and queryable endpoints. Popular knowledge bases such as DBpedia, YAGO2, and Wikidata have chosen the RDF data model to represent their data due to its capabilities for semantically rich knowledge representation.Despite its advantages, there are challenges in using RDF data model, for example, data quality assessment and validation.In this paper, we present a novel knowledge base quality assessment approach that relies on evolution analysis. The proposed approach uses data profiling on consecutive knowledge base releases to compute quality measures that allow detecting quality issues. In particular, we propose four quality characteristics: Persistency, Historical Persistency, Consistency, and Completeness.Persistency and historical persistency measures concern the degree of changes and lifespan of any entity type. Consistency and completeness measures identify properties with incomplete information and contradictory facts. The approach has been assessed both quantitatively and qualitatively on a series of releases from two knowledge bases, eleven releases of DBpedia and eight releases of 3cixty. In particular, a prototype tool has been implemented using the R statistical platform. The capability of Persistence and Consistency characteristics to detect quality issues varies significantly between the two case studies. Persistencymeasure gives observational results for evolving KBs. It is highly effective in case of KB with periodic updates such as 3cixtyKB. The Completeness characteristic is extremely effective and was able to achieve 95% precision in error detection for both use cases. The measures are based on simple statistical operations that make the solution both flexible and scalable.
Full PDF Version: 
Under Review