Abstract:
With the adoption of the Findable, Accessible, Interoperable and Reusable (FAIR) principles for data by researchers, an
increasing amount of datasets have been made available online, supporting research investigations. In order to ease
dataset interoperability, the I-ADOPT framework has been proposed by the scientific community as a means to capture
the subtleties and nuance of scientific variables in a structured manner. However, creating machine readable variable
representations requires significant expertise and manual effort, given the wealth of variable types in use by different
communities. In this paper we explore the use of Large Language Models (LLMs) to aid addressing this manual step.
We propose the I-ADOPT benchmark, an expert annotated corpus and task designed to measure the performance of
LLMs in the different stages of automatically creating a machine readable scientific variable. Our corpus includes more
than 100 scientific variables as structured knowledge graphs, and our results show that even models of large size (32B)
struggle in creating these representations accurately (< 50% F1 score).