Constructing, Enriching and Querying Knowledge Graphs in Natural Language

Tracking #: 3313-4527

This paper is currently under review
Catherine Kosten
Ursin Brunner
Diego Calvanese
Philippe Cudre-Mauroux1
Davide Lanti
Alessandro Mosca
Kurt Stockinger

Responsible editor: 
Guest Editors Interactive SW 2022

Submission type: 
Full Paper
As Knowledge Graphs (KGs) gain traction in both industry and the public sector, more and more legacy databases are accessed through a KG-based layer. Querying such layers requires the mastery of intricate declarative languages such as SPARQL, prompting the need for simpler interfaces, e.g., in natural language (NL). However, translating NL questions into SPARQL and executing the resulting queries on top of a KG-based access layer is impractical for two reasons: (i) automatically generating correct SPARQL queries from NL is difficult as training data is typically scarce and (ii) executing the resulting queries through a simplistic KG layer automatically derived from an underlying relational schema yields poor results. To solve both issues, we introduce ValueNet4Sparql, an end-to-end NL-to-SPARQL system capable of generating high-quality SPARQL queries from NL questions using a transformer-based neural network architecture. ValueNet4Sparql can re-use neural models that were trained on SQL databases and therefore does not require any additional NL/SPARQL-pairs as training data. In addition, our system is able to reconstruct rich schema information in the KG from its relational counterpart using a workload-based analysis, and to faithfully translate complex operations (such as joins or aggregates) from NL to SPARQL. We apply our approach for reconstructing schema information in the KG on the well-known data set Spider and show that it considerably improves the accuracy of the NL-to-SPARQL results---by up to 36% (for a total accuracy of 94%) - compared to a standard baseline. Finally, we also evaluate ValueNet4Sparql on the well known lcquad data set and achieve an F1-score of 85%, which outperforms the state-of-the-art system by 17%.
Full PDF Version: 
Under Review