Systematic Performance Analysis of Distributed SPARQL Query Executing Using Spark-SQL

Tracking #: 2455-3669

This paper is currently under review
Mohamed Ragab
Sadiq Eyvazov
Riccardo Tommasini

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
Recently, a wide range of Web applications is built on top of vast RDF knowledge bases (e.g.DBPedia, Uniprot, and Probase) and using the SPARQL query language. The continuous growth of these knowledge bases led to the investigation of new paradigms and technologies for storing, accessing, and querying RDF data. In practice, modern big data systems like Apache Spark can handle large data repositories. However, their application in the Semantic Web context is still limited. One possible reason is that such frameworks are not tailored for dealing with graph data models like RDF. In this paper, we present a systematic evaluation of the performance of SparkSQL engine for processing SPARQL queries. We configured the experiments using three relevant RDF relational schemas, and two different storage backends, namely, Hive, and HDFS. In addition, we show the impact of using three different RDF-based partitioning techniques with our relational scenario. Moreover, we discuss the results of our experiments showing interesting insights into the impact of different configuration combinations.
Full PDF Version: 
Under Review