Editorial Board

Editor-in-Chief
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Michael Cochez
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Sebastián Ferrada
Mark Gahegan
Aldo Gangemi
Dagmar Gromann
Armin Haller
Pascal Hitzler
Aidan Hogan
Katja Hose
Eero Hyvönen
Krzysztof Janowicz
Sabrina Kirrane
Agnieszka Lawrynowicz
Freddy Lecue
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Angelo Salatino
Christoph Schlieder
Stefan Schlobach
Cogan Shimizu
Blerina Spahiu
Sanju Tiwari
GQ Zhang
Rui Zhu

Former/Founding Editors-in-Chief
Krzysztof Janowicz
Pascal Hitzler

Editorial Assistants
Michael McCain

Syndicate

SHACL-Guided Small Language Models for RDF Knowledge Graph Population:\\ Semantic Evaluation, Cross-Domain Generalisation, and Long-Tail Mitigation

Submitted by Célian Ringwald on 03/19/2026 - 02:27

Tracking #: 4053-5267

This paper is currently under review

Authors:

Célian Ringwald

Fabien Gandon

Catherine Faron

Franck Michel

Hanna Abi Akl

Responsible editor:

Angelo Salatino

Submission type:

Full Paper

Abstract:

Populating RDF knowledge graphs from natural language is challenging when both datatype and object properties must be extracted under schema constraints. We extend the Kastor framework---a SHACL-guided, pattern-based relation extraction approach---scaling it from a single dbo:Person shape to 16 DBpedia entity classes. We introduce four semantic evaluation metrics for object property errors: URI validity, class-level semantic similarity, relation confusion rate, and hallucination estimation. We release KastorKG, an open resource comprising 19 DBpedia SHACL shapes, 141,096 distilled Wikipedia abstract--graph pairs, and 16 fine-tuned small language models (SLMs, 139M parameters). Despite having >100× fewer parameters than LLM baselines, our SLMs outperform 13B-parameter models on 11 of 16 Text2KGBench classes with zero subject or relation hallucinations, versus 12--16% and 7--9% hallucination rates for LLMs. Cross-domain transfer analysis on WebNLG further shows that example-specific pattern overlap and register similarity are stronger predictors of transferability than property-set overlap alone. Finally, long-tail property distributions are identified as the primary performance bottleneck, and sufficient-exposure sampling---guaranteeing at least 1,000 training examples per property---is shown to substantially improve rare-property recall. All code, models, and datasets are publicly available.

Full PDF Version:

swj4053.pdf

Tags:

Under Review

Long-term Stable Link to Resources:

https://github.com/datalogism/Kastor

Log in or register to post comments
596 reads

Main menu

Editorial Board

Syndicate

SHACL-Guided Small Language Models for RDF Knowledge Graph Population:\\ Semantic Evaluation, Cross-Domain Generalisation, and Long-Tail Mitigation

Tracking #: 4053-5267

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

SHACL-Guided Small Language Models for RDF Knowledge Graph Population:\\ Semantic Evaluation, Cross-Domain Generalisation, and Long-Tail Mitigation

Tracking #: 4053-5267

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles