Abstract:
Vast community-driven knowledge graphs (KGs), like Wikidata, are the primary reference data for Entity Linking (EL) applications. However, they exhibit significant coverage bias towards information that is widely popular on the Web, leading to underrepresentation of long-tail entities, particularly from non-contemporary contexts. Concurrently, ongoing mass digitisation of cultural heritage resources reveals numerous named entities and associated knowledge currently missing from general-purpose KGs. Enriching such KGs with these ``NIL'' entities presents an opportunity to enhance their completeness and mitigate biases, such as gender disparity in the representation of historical figures.
In this article, we investigate an approach based on retrieval-augmented generative AI to capture information about NIL entities and generate structured KGs suitable for integration into Wikidata.
The approach is applied to the case of persons unknown to Wikidata but mentioned in a collection of musical periodicals of the 19th century. We empirically select 6 properties used in Wikidata on entities of that type and create a manually annotated NIL-entities KG as a gold standard for evaluation.
Through comprehensive experiments, we evaluate 6 State-of-the-Art Large Language Models (LLMs) from different vendors, combined with 6 different State-of-the-Art retrievers.
Our results demonstrate significant variations in performance across model-retriever combinations, with a high accuracy for gender identification and family name, promising results for occupation and country of citizenship, and low for date of birth.
We report on a detailed error analysis and discuss the potential of our approach for mitigating historical bias in Wikidata.