The novel scalable parallel denoising for Chinese online encyclopedia knowledge base based on the semantic distance of entry tags and Spark cluster

Tracking #: 2516-3730

This paper is currently under review
Ting Wang
Jie Li1
Jiale Guo

Responsible editor: 
Guilin Qi

Submission type: 
Full Paper
Because of the open-collaborative of online encyclopedia, a large number of knowledge triples are improperly classified in the online encyclopedia system, so it is inevitable to denoise and refine the open-domain encyclopedia knowledge bases (KBs) to improve its quality and precision. However, the lack and inaccuracy of triple semantic features will lead to poor refining effect. Besides, in the face of large-scale encyclopedia KBs, the processing of massive knowledge will lead to too much computing time and poor scalability of the algorithm. In order to solve the problems of knowledge denoising in the Chinese encyclopedia system, firstly, based on the data field theory, this paper proposes a new Cartesian product mapping-based method for quantifying the quality of entry tags, based on which the semantic quantification of encyclopedia KB is carried out. Secondly, this paper proposes a new method based on multi-feature fusion to calculate the semantic distance between the "out-of-vocabulary" entry tags and embed it into the potential function, so as to further improve the potential function and denoising effect on KBs. Thirdly, in order to make our algorithm have good scalability, the proposed denoising algorithm is implemented and optimized in parallel based on Spark cluster computing framework. Finally, a comprehensive comparative analysis is made on the denoising effect and time efficiency with the state-of-the-art distributed Chinese encyclopedia knowledge denoising algorithm. The experimental results on the real-world datasets show that the parallel denoising algorithm proposed in this paper can improve the efficiency of knowledge denoising and the accuracy of KBs, and outperforms the state-of-the-art methods.
Full PDF Version: 
Under Review