An efficient SPARQL engine for unindexed binary-formatted IFC building models

Tracking #: 1705-2917

This paper is currently under review
Thomas Krijnen
Jakob Beetz

Responsible editor: 
Guest Editors ST Built Environment 2017

Submission type: 
Full Paper
To date, widely implemented and full-featured query languages for building models in their native formats do not exist. While interesting proposals have been formulated to query Industry Foundation Classes (IFC) models, their functionality is often not complete and their semantics not defined precisely. With the introduction of the ifcOWL ontology as an equivalent to the IFC schema defined in the EXPRESS modelling language, a representation of native architectural and engineering building models in RDF is provided and such models can be queried using SPARQL. The requirements stemming from the size of data sets handled in complex building projects however, make the use of clear-text encoded RDF infeasible in many use cases. The IFC serialization in RDF is not as succinct as the STEP Physical File Format (IFC-SPF), in which IFC building models are predominantly encoded. This introduces a relative overhead in query processing and file size. In this paper we propose a SPARQL implementation, compatible with ifcOWL, directly on top of a standardized binary serialization format for IFC building models, which is a direct binary equivalent of IFC-SPF, with less overhead than the graph serialization in RDF. A prototypical implementation of the query engine is provided in the Python programming language. This binary serialization format, presented in earlier work of the authors, is based on an open file format for organizing large amounts of data, Hierarchical Data Format (HDF5). It has several properties suitable for querying. Due to the hierarchical partitioning and fixed-length records, known entity instances can be retrieved in constant time, rather than logarithmic time in a sorted or indexed dataset. Statistics, such as the prevalence of instances of a certain type, can be derived in constant time from the dataset metadata. With instances partitioned by their type, querying typically only operates on a small subset of the data. We compare the processing times for eight queries on five building models. The Apache Jena ARQ query engine (using N-triples, Turtle, TDB and HDT), RDF-3X and the system proposed in this paper are compared. We show that in several realistic use cases the interpreted Python code performs equivalent or better to the state of the art implementations, including optimized C++ executables. In other cases, due to the linear nature of the unindexed storage format, query times fall behind, but do not exceed several seconds, and as such, are still orders of magnitude better than the time to parse N-triples and Turtle files. Due to the absence of indexes, the proposed binary IFC format can be updated without overhead. For large models the proposed storage format results in files that are 2-3 times smaller than the currently most concise alternative.
Full PDF Version: 
Under Review