Editorial Board

Editor-in-Chief
Krzysztof Janowicz

Managing Editors
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Michael Cochez
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Mark Gahegan
Aldo Gangemi
Anna Lisa Gentile
Rafael Goncalves
Dagmar Gromann
Armin Haller
Pascal Hitzler
Aidan Hogan
Katja Hose
Eero Hyvönen
Sabrina Kirrane
Agnieszka Lawrynowicz
Freddy Lecue
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Christoph Schlieder
Stefan Schlobach
Cogan Shimizu
Blerina Spahiu
GQ Zhang
Rui Zhu

Former/Founding Editors-in-Chief
Pascal Hitzler

Editorial Assistants
Michael McCain

Syndicate

Leveraging Biochemical Knowledge Extraction from Academic Literature through Large Language Models: A study of fine-tuning and similarity search

Submitted by Paulo Carmo on 08/26/2025 - 04:13

Tracking #: 3941-5155

This paper is currently under review

Authors:

Paulo Carmo

Marcos Gôlo

Jonas Gwozdz

Edgard Marx2

Stefan Schmidt-Dichte

Caterina Thimm

Matthias Jooss

Pit Fröhlich

Ricardo Marcacini

Responsible editor:

Mehwish Alam

Submission type:

Full Paper

Abstract:

The discovery of new drugs based on natural products is related to the efficient extraction of biochemical knowledge from scientific literature. Recent studies have introduced several enhancements to the NatUKE benchmark, improving the performance of knowledge graph embedding methods. These enhancements include refined PDF text extraction, named entity recognition, and improved embedding techniques. Notably, some approaches have incorporated large language models (LLMs). Building on these advances, this study explores the fine-tuning of LLMs with similarity search, exploring both open-source and proprietary models for the automatic extraction of biochemical properties. We fine-tune the LLMs with similarity search to mitigate textual inconsistencies and enhance the prediction of five target properties: compound name, bioactivity, species, collection site, and isolation type. Experimental results demonstrate that similarity search consistently improves the performance, and open-source models can be competitive, occasionally outperforming proprietary models. We also find that the effectiveness of fine-tuning varies across models and biochemical properties. Overall, our findings highlight the potential of LLMs, particularly when fine-tuned and augmented with similarity search, as powerful tools for accelerating the extraction of biochemical knowledge from scientific texts.

Full PDF Version:

swj3941.pdf

Tags:

Under Review

Long-term Stable Link to Resources:

https://github.com/chemnet-io/ballena

Log in or register to post comments
373 reads

Main menu

Editorial Board

Syndicate

Leveraging Biochemical Knowledge Extraction from Academic Literature through Large Language Models: A study of fine-tuning and similarity search

Tracking #: 3941-5155

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

Leveraging Biochemical Knowledge Extraction from Academic Literature through Large Language Models: A study of fine-tuning and similarity search

Tracking #: 3941-5155

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles