Question Answering with Deep Neural Networks for Semi-Structured Heterogeneous Genealogical Knowledge Graphs

Tracking #: 2925-4139

Omri Suissa
Maayan Zhitomirsky-Geffet1
Avshalom Elmalech

Responsible editor: 
Special Issue Cultural Heritage 2021

Submission type: 
Full Paper
With the rising popularity of user-generated genealogical family trees, new genealogical information systems have been developed. State-of-the-art natural question answering algorithms use deep neural network (DNN) architecture based on self-attention networks. However, some of these models use sequence-based inputs and are not suitable to work with graph-based structure, while graph-based DNN models rely on high levels of comprehensiveness of knowledge graphs that is nonexistent in the genealogical domain. Moreover, these supervised DNN models require training datasets that are absent in the genealogical domain. This study proposes an end-to-end approach for question answering using genealogical family trees by: 1) representing genealogical data as knowledge graphs, 2) converting them to texts, 3) combining them with unstructured texts, and 4) training a trans-former-based question answering model. To evaluate the need for a dedicated approach, a comparison between the fine-tuned model (Uncle-BERT) trained on the auto-generated genealogical dataset and state-of-the-art question-answering models was per-formed. The findings indicate that there are significant differences between answering genealogical questions and open-domain questions. Moreover, the proposed methodology reduces complexity while increasing accuracy and may have practical implications for genealogical research and real-world projects, making genealogical data accessible to experts as well as the general public.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ricardo Usbeck submitted on 22/Nov/2021
Review Comment:

Thanks to the authors for providing a color-coded paper and a thorough cover letter. I do really appreciate your effort! The paper got significantly improved and thus, hopefully will achieve a significant impact in the DH community.

# Long-term stable URL for resources assess:
The paper presents a novel DNN model for QA over genealogical data. Neither the raw GEDCOM data, nor the KG, nor the Gen-SQuAD data nor the fine-tuned QA is publicly available - due to GDPR constraints as the authors explain in their cover letter. None of the criteria below can be assessed due to being protected under the European General Data Protection Regulation (GDPR) and Israeli Protection of Privacy Regulations. The model, the vocabulary, tokenizer config, special chars mapping, and other configurations are available to the reviewers.
The new version of the paper allows replicability in a sense, that users would need to use their model, the presented model, and their dataset to calculate numbers. Still, the numbers from the paper cannot be reproduced but that is not a big issue in this case-
The new assessment of the long-term stable URL (which is currently not long-term stable) is: (A) A READMe, example file, and example code are available.. (B) it depends, see above, (C) yes, (D) given the paper. yes. Overall, the authors did a good job of enhancing the resource material.

# Review
The authors propose an end-to-end QA approach for the field of genealogy, a first in the field. The authors use existing semi-structured data (RDF+full-text) and convert it into a form that is suitable for machine-reading/comprehension algorithms.

## Introduction
The introduction reads well and allows laypersons to get familiar with the problem at hand. The motivation and the contributions are clear.

## Related Work
The related work section is convincing and covers all standard literature for DNNs as well as QA. It is also a good read for beginners, as it introduces all main concepts in detail. Figures one, four, and five help to understand the standard as well as the proposed QA pipeline.

## Method
The new version of the chapter reads really well!

## Experimental design
The experimental design section is also well-written and easy to follow.

## Results
The results section is quite easy to understand and up to par. An ablation study was performed on the input parameter (degree).

Review #2
By Isaiah Onando Mulang' submitted on 22/Nov/2021
Review Comment:

The authors have taken time to respond to the comments form the first reviews. I view the rebuttal as sufficient to make the paper acceptable. Specifically, it was important to clearly understand what more the paper offers besides fine-tuning of BERT and data generation. The discussion on the contextual formulation of data from the domain and consideration of size of context in the model (1-hop, 2-hop, n-hop away context) provides insight not only to the value of context selection, but as well the methodology of using context in constrained or limited scenario such as the fine-tuning approach. It leaves the question of further experimentation, on how to efficiently incorporate context in models.

Substantiated to a good extent is the claim that the Uncle-BERT model reduces complexity while increasing accuracy (as far as the the comparison with Uncle-DELFT2). It would however be more impactful to provide more baselines.

Overall, the approach is not absolutely novel, however the motivation and data justifies my choice to accept the paper.