Abstract:
Integrating Schema.org markup into web pages has resulted in the generation of billions of RDF triples. However, around 75% of web
pages still lack this critical markup. Large Language Models (LLMs)
present a promising solution by automatically generating the missing
Schema.org markup. Despite this potential, there is currently no
benchmark to evaluate the markup quality produced by LLMs.
This paper introduces LLM4Schema.org, an innovative approach for assessing
the performance of LLMs in generating Schema.org markup. Unlike
traditional methods, LLM4Schema.org does not require a predefined ground
truth. Instead, it compares the quality of LLM-generated markup
against human-generated markup.
Our findings reveal that 40–50% of the markup produced by GPT-3.5
and GPT-4 is invalid, non-factual, or non-compliant with the
Schema.org ontology. These errors underscore the limitations of LLMs
in adhering strictly to structured ontologies like Schema.org
without additional filtering and validation mechanisms.
We demonstrate that specialized LLM-powered agents can effectively
identify and eliminate these errors. After applying such filtering
for both human and LLM-generated markup, GPT-4 shows notable
improvements in quality and outperforms humans.
LLM4Schema.org highlights both the potential and challenges of leveraging
LLMs for semantic annotations, emphasizing the critical role of
careful curation and validation in achieving reliable results.