Extensive Benchmark of Small Language Models for Datatype Properties Extraction and RDF Knowledge Graph Generation

Tracking #: 3845-5059

Authors: 
Célian Ringwald
Fabien Gandon
Catherine Faron
Franck Michel
Hanna Abi Akl

Responsible editor: 
Guest Editors 2025 LLM GenAI KGs

Submission type: 
Full Paper
Abstract: 
The choice made for representing the inputs and outputs of generative pre-trained language models (PLMs) can impact their fine-tuning on a new task. This article focuses on the fine-tuning and linearization process to generate facts extracted from text. On a restricted relation extraction (RE) task, we challenged five encoder-decoder models including BART, T5, CodeT5, FlanT5 and PileT5 by fine-tuning them on 13 linearization variations, including RDF standard syntaxes and variations thereof. Our benchmark covers the validity of the produced triples, the model's performance, the training behaviour and the resources needed. We show these PLMs can learn some syntaxes more easily than others, and we identify a promising ``Turtle Light'' syntax supporting the quick and robust learning of the RE task.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 12/Jul/2025
Suggestion:
Major Revision
Review Comment:

The paper reports a benchmark on a set of Small Language Models (SLMs) on triple extraction from text documents. Specifically, the authors are focusing on evaluating the impact of triple linearization methods on the datatype property relation extraction (i.e., triples with datatype properties, as the opposite of object properties). The article is an extension of a paper from the ESWC special track on LLMs for Knowledge Engineering (2024).

Compared to the previous version of the paper, the authors propose the following extension in their investigation: (a) an introduction to Turtle Ultra Light syntax: a slight refinement of Turtle-light through the removal of the ":" character, and (b) introduction of additional SLMs: (a) CodeT5, (b) FlanT5, (c) PileT5, in addition to BART and T5. Given that the paper is an extension of the ESWC paper, such an extension fulfils the 30% extension required for submission to this venue.

However, in the current state, there are several drawbacks to the article, such as the following:

(-) The reasoning behind the selection of additional SLMs is not convincing. If the goal is to focus on smaller, frugal models, I would question why the smaller decoder-only models, e.g., GPT-Neo 125M and OPT-125M, were not included in the selection. Such additional models would broaden the spectrum and make the result more generic.

(-) Furthermore, while the authors expand the scope of the investigation to include the aspects mentioned above, result-wise, it did not significantly differ from the results in the previous paper. It is reflected in the result that none of the additional SLMs or added syntax make it to the first page of the list (Top-24).

(-) Another drawback is related to the benchmark dataset. In the original paper, the authors focus only on a single class (dbo:Person) and its datatype properties. However, as a journal article, I expect the author to expand the dataset to include additional class(es) and properties, possibly derived from other sources, e.g., Wikidata, to confirm whether the result holds in a more generic setup.

(-) The current article's title and introduction do not reflect the core content of the paper. The title is slightly misleading (i.e., an extensive performance evaluation of small language models is not the paper's main focus). Furthermore, the introduction of the paper is heavily based on the previous paper, which is very condensed and hard to understand (presumably due to the limitation of space in the special track). Further elaboration on the context, problem description, proposed approach, and key results as part of the Introduction section will be helpful for the reader.

On the positive side, the authors provide all relevant resources in a GitHub repository and wand.ai platform to cater for further investigation and reproducibility of the experiment results.

Minor comments
- P1 L35: "the construction of massive corpora aligning texts and facts from knowledge graphs (KG) e.g. Wikipedia articles with corresponding Wikidata or DBpedia subgraphs": it is not clear what the authors mean here. Please add references.
- P1 L39: "In this context, extracting from Wikipedia the missing information in KGs is a critical task.": "... extracting the missing information in KGs from Wikipedia is critical."
- P1 L41: "Now that we have end-to-end off-the-shelf methods, ...": which methods are you referring to? Please add references.
- P2 L36: Various RDF serializations are mentioned in a sentence; please consider formatting it as bullet points or tables to make it more readable.
- P3 L21: Step 1.1 and Step 1.2 are switched (compared to the figure)

Review #2
By Pablo Calleja submitted on 03/Aug/2025
Suggestion:
Minor Revision
Review Comment:

(1) Originality
The work is original. It presents a detailed evaluation of different pretrained models for the task of RDF generation, particularly focusing on data properties with different syntaxes. The results show that these models exhibit a high variability in performance but also highlight the importance of good training practices, reaching optimal points without the need for large infrastructures or excessive CO₂ consumption.
(2) Significance of the results
The work is significant. Although it builds upon a previous paper, it improves and extends the experiments and evaluation, including metrics such as optimal training point detection and carbon footprint estimation. Moreover, it proposes variations of RDF syntax so that the vocabulary used is more suitable.
The part that could be further improved is the discussion of the results in connection with the paper’s conclusions. The authors provide a large amount of results, but it is not entirely clear whether BART’s or T5’s strong performance is due to the training corpus, the tokenizer, or its architecture compared to the T5 family. This section should be expanded.
(3) Quality of writing
The quality of the writing is good. The ideas are presented in an organized manner, and the steps and discussions are clearly and thoroughly detailed.

COMMENTS
1- I do not consider that the title should include Small Language Models, as this is not entirely accurate. The term "small" is currently being used to refer to other models (which are even larger, e.g., 2 billion parameters). The authors also do not return to this designation later in the paper. They could simply leave it as pretrained language models, as in the rest of the paper, and their work would still be equally valid.
There is an error at the top of each page it says SML when it should be SLM.
2- The authors place significant emphasis on the RE task, although in this context, since they are working with documents containing text along with 3 to several data properties, the task could rather be considered as Triple Extraction (TE). RE typically involves identifying the entities to be linked and narrowing down the relation types. This could be discussed.
3- Page 1, line 39: “quality issues” — which ones?

4- Why extracting from wikipedia the missing information is a critical task?
5- Figure 2 is not referenced. This is a general comment for all the paper, there are figures not referenced which have very good representations and information to be detailed.
6- Page 5, line 29. The ordering step is not clear at all why is important.
7- Table 2 not referenced. What is nb_prop?
8- Figure 3 is very good, put a reference.
9- Page 7, line 17. The index for different sintaxes are wrong.
10- Page 8, Is Turtle light syntax on one line wrong?
11- Very interesting the Figure 7. However is not referenced on the paper. Which are the main problems of tokenization of RDF that BART could handle all of the syntaxes and the others not?
12- Page 11, line 8. 5000 examples and 250 disjoint examples? Earlier was 6.000 in D. Just need to explain better.
13- In the same section. There are not problems of max length including the prompt?
14- Page 12, line 11. From this point, the F1 is represented with the line above but its different from F1+. Just explain the difference.

Very interesting examples of errors in page 15

As mentioned earlier, it would have been interesting to include a discussion of the results explaining why BART remains a winning model for this task, despite, a priori, not having been exposed in detail to RDF as a primary training source. Is it the type of tokenization that allows greater representativeness? Could a T5 trained from scratch including RDF (along with its tokenizer) achieve the same results?
Overall, it is a good piece of work that shows that not everything has yet been solved by LMs and that computational cost and carbon footprint are also important considerations.

Review #3
Anonymous submitted on 07/Aug/2025
Suggestion:
Minor Revision
Review Comment:

Overview

The paper benchmarks small encoder–decoder language models fine-tuned to extract RDF datatype property triples from text. It evaluates how different structured output formats—ranging from standard RDF serializations (e.g., Turtle, JSON-LD) to custom simplified versions—affect model performance, learning efficiency, and output validity. The study focuses on the interplay between output linearization formats and model effectiveness, rather than on the models' knowledge extraction or reasoning capabilities per se.

Strengths

- Originality: The paper tackles a novel aspect of relation extraction – the impact of RDF output format on model fine-tuning. Prior works fine-tuned generative models for triple extraction (e.g. REBEL[1]), but this work is unique in systematically comparing a dozen RDF serializations, including W3C-standard ones, for generating knowledge graph triples.

- Significance of Results: The benchmark yields actionable insights for the community. It identifies that certain linearizations (notably the proposed "Turtle Light") enable high extraction performance with lower training cost, whereas fully verbose syntaxes (JSON-LD, full Turtle) demand more resources. These findings are important for designing future knowledge graph population systems – the choice of output encoding is shown to materially affect both model accuracy and efficiency.

- Experimental Thoroughness: The evaluation is extensive. The authors test five different PLMs across 13 output formats, use rigorous metrics (triple validity, F1, training convergence behavior), and even report training time and CO2 footprint. This multifaceted analysis strengthens confidence in the conclusions. They also perform K-fold validation and analyze variations like subject grouping (factorization) and one-line vs. multi-line format, which adds depth to the study.

- Reproducibility and Data Sharing: The study is supported by a well-organized public repository on GitHub with a detailed README. It includes the data preparation scripts, configuration files for each model/format, and even links to trained models and logs on Weights & Biases. This openness greatly facilitates understanding and replication of the experiments. The use of an established DBpedia dump as the data source and provided SHACL shape ensures the task is clearly defined and repeatable.

Weaknesses
- The paper does not explain why only encoder–decoder models were used. Decoder-only transformer models have become dominant in recent years (especially since GPT-style models), with many open-weight options available for fine-tuning (e.g., LLaMA, Falcon). Their exclusion seems arbitrary and unexplained, especially since decoder-only LLMs have demonstrated strong performance across structured generation tasks.

- While the study focuses on structured output formatting, it does not address how the presence of Wikipedia in pretraining data—given DBpedia’s origin—might have influenced the models' ability to reproduce correct outputs, independent of learning the syntax itself.

- Scope and Generality: The evaluation is restricted to datatype properties of Person entities in DBpedia. This narrow focus limits the generality of the conclusions – it remains unclear if the observed trends hold for object properties or other knowledge domains. The paper itself notes future work will extend to object properties, highlighting that the current study addresses only a specific subset of the KG extraction problem.

- Incremental Contribution: While thorough, the work builds upon existing approaches (e.g. it fine-tunes off-the-shelf models and a REBEL-based pipeline). The concept of linearizing triples for sequence-to-sequence models is not entirely new[15]. The main novelty lies in comparing many syntax variations, which is useful but arguably an empirical extension of prior ideas. Some might view the contribution as more of a “large-scale evaluation” than a fundamentally new method or theory.

- Comparative Baselines: The study could be strengthened by including additional baselines or points of reference. For example, it does not report results for larger decoder-only LLMs (like GPT-3/4) on the task – even if only zero-shot or few-shot – which could indicate the performance gap between small fine-tuned models and large prompted models. Similarly, comparing against a non-generative baseline (e.g. a BERT-based classification or slot-filling approach for the same properties) could help quantify the absolute performance of the proposed approach. The absence of such comparisons leaves the reader to assume the fine-tuned generative models are state-of-the-art, but some context on how close to human or baseline performance they are would add value.

- Writing Quality: Overall the paper is clearly written and structured, but there are minor issues. In a few places, dense technical descriptions and notation (e.g. in the linearization notation or metric definitions) could be explained more intuitively for readability. The term “small language models” is used to describe 220M+ parameter models without explicit clarification – this could be misleading, and a brief explanation that “small” is relative to models like GPT-3 would help. These are minor style quibbles; the manuscript is otherwise well-organized with a good flow from problem setup through results.

Comments

- What is the motivation for choosing only encoder–decoder models? The paper briefly mentions scalability challenges with decoder-only models, but that does not explain excluding them from full fine-tuning -- several decoder-only models have been adapted to extraction tasks. I would recommend discussing this decision in a bit more detail in the paper revision.

- Page 9, lines 26–27: The sentence about Wikipedia pretraining is unclear. It first suggests models weren’t trained on Wikipedia, then says they were. Note: The Pile includes English Wikipedia (Gao et al., 2020).

- The models were likely pretrained on Wikipedia, which overlaps with the DBpedia-based task. Although the study centers on evaluating output formatting rather than factual correctness per se, some performance gains may stem from content memorization rather than syntax learning. A more controlled analysis—e.g., testing on synthetic or out-of-domain entities—could help disentangle formatting generalization from content familiarity.

- The title “1. Introduction: Targeted Datatype Properties Extraction” seems unconventional. I recommend using simply “1. Introduction” to align with standard practice. The same applies to “2. Related Work”—I suggest sticking to the conventional title without additional text. If needed, the authors can indicate themes within the section using bold or italicized phrases at the start of paragraphs. The same comment applies to other section headers, such as “3. Methodological Framework: Definitions and Notations”—“3. Methodological Framework” is sufficient, especially since the subsections already indicate the specific content. In general, I recommend avoiding colons in main section titles. While occasionally justified—e.g., when providing an expanded form of an acronym—avoiding them improves consistency with common conventions and makes papers easier to parse programmatically. Unusual title formats often lead to misprocessing in downstream tasks.

- Data Repository Evaluation: The GitHub repository (https://github.com/datalogism/12ShadesOfRDFSyntax) is well organized and includes a clear README with syntax examples. All necessary resources for replication—code, configurations, SHACL shape, and model outputs—are provided. GitHub is suitable, though archiving on Zenodo is recommended for long-term accessibility. The artifacts appear complete and support reproducibility.

- Related Work: The related work coverage is adequate; however, to strengthen it further, the authors could cite concurrent research like Text2KGBench [2] or the work by Frey et al. on "How well do LLMs speak Turtle" [3], as these reinforce the timeliness of evaluating LLMs for structured output generation.

References
1. Cabot, P. L. H., & Navigli, R. (2021, November). REBEL: Relation extraction by end-to-end language generation. In Findings of the association for computational linguistics: emnlp 2021 (pp. 2370-2381).
2. Mihindukulasooriya, N., Tiwari, S., Enguix, C. F., & Lata, K. (2023, October). Text2kgbench: A benchmark for ontology-driven knowledge graph generation from text. In International semantic web conference (pp. 247-265). Cham: Springer Nature Switzerland.
3. Frey, J., Meyer, L. P., Arndt, N., Brei, F., & Bulert, K. (2023). Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?.