SHACL-Guided Small Language Models for RDF Knowledge Graph Population:\\ Semantic Evaluation, Cross-Domain Generalisation, and Long-Tail Mitigation

Tracking #: 4053-5267

This paper is currently under review
Authors: 
Célian Ringwald
Fabien Gandon
Catherine Faron
Franck Michel
Hanna Abi Akl

Responsible editor: 
Angelo Salatino

Submission type: 
Full Paper
Abstract: 
Populating RDF knowledge graphs from natural language is challenging when both datatype and object properties must be extracted under schema constraints. We extend the Kastor framework---a SHACL-guided, pattern-based relation extraction approach---scaling it from a single dbo:Person shape to 16 DBpedia entity classes. We introduce four semantic evaluation metrics for object property errors: URI validity, class-level semantic similarity, relation confusion rate, and hallucination estimation. We release KastorKG, an open resource comprising 19 DBpedia SHACL shapes, 141,096 distilled Wikipedia abstract--graph pairs, and 16 fine-tuned small language models (SLMs, 139M parameters). Despite having >100× fewer parameters than LLM baselines, our SLMs outperform 13B-parameter models on 11 of 16 Text2KGBench classes with zero subject or relation hallucinations, versus 12--16% and 7--9% hallucination rates for LLMs. Cross-domain transfer analysis on WebNLG further shows that example-specific pattern overlap and register similarity are stronger predictors of transferability than property-set overlap alone. Finally, long-tail property distributions are identified as the primary performance bottleneck, and sufficient-exposure sampling---guaranteeing at least 1,000 training examples per property---is shown to substantially improve rare-property recall. All code, models, and datasets are publicly available.
Full PDF Version: 
Tags: 
Under Review