Review Comment:
Overview
The paper benchmarks small encoder–decoder language models fine-tuned to extract RDF datatype property triples from text. It evaluates how different structured output formats—ranging from standard RDF serializations (e.g., Turtle, JSON-LD) to custom simplified versions—affect model performance, learning efficiency, and output validity. The study focuses on the interplay between output linearization formats and model effectiveness, rather than on the models' knowledge extraction or reasoning capabilities per se.
Strengths
- Originality: The paper tackles a novel aspect of relation extraction – the impact of RDF output format on model fine-tuning. Prior works fine-tuned generative models for triple extraction (e.g. REBEL[1]), but this work is unique in systematically comparing a dozen RDF serializations, including W3C-standard ones, for generating knowledge graph triples.
- Significance of Results: The benchmark yields actionable insights for the community. It identifies that certain linearizations (notably the proposed "Turtle Light") enable high extraction performance with lower training cost, whereas fully verbose syntaxes (JSON-LD, full Turtle) demand more resources. These findings are important for designing future knowledge graph population systems – the choice of output encoding is shown to materially affect both model accuracy and efficiency.
- Experimental Thoroughness: The evaluation is extensive. The authors test five different PLMs across 13 output formats, use rigorous metrics (triple validity, F1, training convergence behavior), and even report training time and CO2 footprint. This multifaceted analysis strengthens confidence in the conclusions. They also perform K-fold validation and analyze variations like subject grouping (factorization) and one-line vs. multi-line format, which adds depth to the study.
- Reproducibility and Data Sharing: The study is supported by a well-organized public repository on GitHub with a detailed README. It includes the data preparation scripts, configuration files for each model/format, and even links to trained models and logs on Weights & Biases. This openness greatly facilitates understanding and replication of the experiments. The use of an established DBpedia dump as the data source and provided SHACL shape ensures the task is clearly defined and repeatable.
Weaknesses
- The paper does not explain why only encoder–decoder models were used. Decoder-only transformer models have become dominant in recent years (especially since GPT-style models), with many open-weight options available for fine-tuning (e.g., LLaMA, Falcon). Their exclusion seems arbitrary and unexplained, especially since decoder-only LLMs have demonstrated strong performance across structured generation tasks.
- While the study focuses on structured output formatting, it does not address how the presence of Wikipedia in pretraining data—given DBpedia’s origin—might have influenced the models' ability to reproduce correct outputs, independent of learning the syntax itself.
- Scope and Generality: The evaluation is restricted to datatype properties of Person entities in DBpedia. This narrow focus limits the generality of the conclusions – it remains unclear if the observed trends hold for object properties or other knowledge domains. The paper itself notes future work will extend to object properties, highlighting that the current study addresses only a specific subset of the KG extraction problem.
- Incremental Contribution: While thorough, the work builds upon existing approaches (e.g. it fine-tunes off-the-shelf models and a REBEL-based pipeline). The concept of linearizing triples for sequence-to-sequence models is not entirely new[15]. The main novelty lies in comparing many syntax variations, which is useful but arguably an empirical extension of prior ideas. Some might view the contribution as more of a “large-scale evaluation” than a fundamentally new method or theory.
- Comparative Baselines: The study could be strengthened by including additional baselines or points of reference. For example, it does not report results for larger decoder-only LLMs (like GPT-3/4) on the task – even if only zero-shot or few-shot – which could indicate the performance gap between small fine-tuned models and large prompted models. Similarly, comparing against a non-generative baseline (e.g. a BERT-based classification or slot-filling approach for the same properties) could help quantify the absolute performance of the proposed approach. The absence of such comparisons leaves the reader to assume the fine-tuned generative models are state-of-the-art, but some context on how close to human or baseline performance they are would add value.
- Writing Quality: Overall the paper is clearly written and structured, but there are minor issues. In a few places, dense technical descriptions and notation (e.g. in the linearization notation or metric definitions) could be explained more intuitively for readability. The term “small language models” is used to describe 220M+ parameter models without explicit clarification – this could be misleading, and a brief explanation that “small” is relative to models like GPT-3 would help. These are minor style quibbles; the manuscript is otherwise well-organized with a good flow from problem setup through results.
Comments
- What is the motivation for choosing only encoder–decoder models? The paper briefly mentions scalability challenges with decoder-only models, but that does not explain excluding them from full fine-tuning -- several decoder-only models have been adapted to extraction tasks. I would recommend discussing this decision in a bit more detail in the paper revision.
- Page 9, lines 26–27: The sentence about Wikipedia pretraining is unclear. It first suggests models weren’t trained on Wikipedia, then says they were. Note: The Pile includes English Wikipedia (Gao et al., 2020).
- The models were likely pretrained on Wikipedia, which overlaps with the DBpedia-based task. Although the study centers on evaluating output formatting rather than factual correctness per se, some performance gains may stem from content memorization rather than syntax learning. A more controlled analysis—e.g., testing on synthetic or out-of-domain entities—could help disentangle formatting generalization from content familiarity.
- The title “1. Introduction: Targeted Datatype Properties Extraction” seems unconventional. I recommend using simply “1. Introduction” to align with standard practice. The same applies to “2. Related Work”—I suggest sticking to the conventional title without additional text. If needed, the authors can indicate themes within the section using bold or italicized phrases at the start of paragraphs. The same comment applies to other section headers, such as “3. Methodological Framework: Definitions and Notations”—“3. Methodological Framework” is sufficient, especially since the subsections already indicate the specific content. In general, I recommend avoiding colons in main section titles. While occasionally justified—e.g., when providing an expanded form of an acronym—avoiding them improves consistency with common conventions and makes papers easier to parse programmatically. Unusual title formats often lead to misprocessing in downstream tasks.
- Data Repository Evaluation: The GitHub repository (https://github.com/datalogism/12ShadesOfRDFSyntax) is well organized and includes a clear README with syntax examples. All necessary resources for replication—code, configurations, SHACL shape, and model outputs—are provided. GitHub is suitable, though archiving on Zenodo is recommended for long-term accessibility. The artifacts appear complete and support reproducibility.
- Related Work: The related work coverage is adequate; however, to strengthen it further, the authors could cite concurrent research like Text2KGBench [2] or the work by Frey et al. on "How well do LLMs speak Turtle" [3], as these reinforce the timeliness of evaluating LLMs for structured output generation.
References
1. Cabot, P. L. H., & Navigli, R. (2021, November). REBEL: Relation extraction by end-to-end language generation. In Findings of the association for computational linguistics: emnlp 2021 (pp. 2370-2381).
2. Mihindukulasooriya, N., Tiwari, S., Enguix, C. F., & Lata, K. (2023, October). Text2kgbench: A benchmark for ontology-driven knowledge graph generation from text. In International semantic web conference (pp. 247-265). Cham: Springer Nature Switzerland.
3. Frey, J., Meyer, L. P., Arndt, N., Brei, F., & Bulert, K. (2023). Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?.
|