From Scientific Variables to Knowledge Graphs: The I-ADOPT Benchmark

Tracking #: 4005-5219

This paper is currently under review
Authors: 
Barbara Magagna
Arvin Rastegar
Esteban González Guardia
Cristian Berrio
Stuart Chalk
Jose Manuel Gomez-Perez1
Christof Lorenz
Saurav Kumar
Daniel Garijo

Responsible editor: 
Guest Editors ML and KR 2025

Submission type: 
Full Paper
Abstract: 
With the adoption of the Findable, Accessible, Interoperable and Reusable (FAIR) principles for data by researchers, an increasing amount of datasets have been made available online, supporting research investigations. In order to ease dataset interoperability, the I-ADOPT framework has been proposed by the scientific community as a means to capture the subtleties and nuance of scientific variables in a structured manner. However, creating machine readable variable representations requires significant expertise and manual effort, given the wealth of variable types in use by different communities. In this paper we explore the use of Large Language Models (LLMs) to aid addressing this manual step. We propose the I-ADOPT benchmark, an expert annotated corpus and task designed to measure the performance of LLMs in the different stages of automatically creating a machine readable scientific variable. Our corpus includes more than 100 scientific variables as structured knowledge graphs, and our results show that even models of large size (32B) struggle in creating these representations accurately (< 50% F1 score).
Full PDF Version: 
Tags: 
Under Review