Review Comment:
Overall
The article “Question Answering with Deep Neural Networks for Semi-Structured Heterogeneous Genealogical Knowledge Graphs” fits into the topic from the call of papers “Knowledge-Driven NLP for Digital Humanities” as well as “Machine Learning for Knowledge Graphs in Digital Humanities”.
Originality: The work is original as far as I can tell and could lead a new path of downstream NLP tasks.
Significance of the results: For genealogy, this model provides an important stepping stone not only for question answering but for neural network-based NLP tasks in the respective domain (if the model would be available)
Quality of writing: The paper is well-written and easy to follow, for both CS and digital humanities readers.
Long-term stable URL for resources assess:
The paper presents a novel DNN model for QA over genealogical data. Neither the raw GEDCOM data, nor the KG, nor the Gen-SQuAD data nor the fine-tuned QA is publicly available. None of the criteria below can be assessed due to being protected under the European General Data Protection Regulation (GDPR) and Israeli Protection of Privacy Regulations. The model, the vocabulary, tokenizer config, special chars mapping, and other configurations are available to the reviewers. We entrust the authors to publish this data after acceptance. It allows replicability in a sense, that users would need to use their model, the presented model, and their dataset to calculate numbers. However, the numbers from the paper cannot be reproduced. The assessment of the long-term stable URL (which is currently not long-term stable) is: (A) the folder is clean but has no README and thus no instructions on how to load the data. A sample python code on how to load the data and a toy gedcom tree would be nice. Also, a license file is missing. (B) no, see above, (C) no, (D) cannot check due to (A).
(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,
(B) whether the provided resources appear to be complete for replication of experiments, and if not, why,
(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and
(4) whether the provided data artifacts are complete.
Review
The authors propose an end-to-end QA approach for the field of genealogy, a first in the field. The authors use existing semi-structured data (RDF+full-text) and convert it into a form that is suitable for machine-reading/comprehension algorithms.
## Introduction
The introduction reads well and allows laypersons to get familiar with the problem at hand. The motivation is clear.
The example on page 2 might need revision, as the main search engines answer with a number, not a list (Google, Siri, Bing tested). Maybe use a less familiar person.
The structure of the contributions could be improved. Currently, there are two contributions, one examination contribution, and three research questions. It would be easier to also format the contributions according to the research questions in an itemized list.
## Related Work
The related work section is convincing and covers all standard literature for DNNs as well as QA. It is also a good read for beginners, as it introduces all main concepts in detail. Figures one, four, and five help to understand the standard as well as the proposed QA pipeline.
Figures two and three are misplaced in a sense, that the paragraph needed to interpret them comes on the next page.
Section 2.2. could enhance cross-linking the topic by introducing synonyms to the field of work such as Machine Reading Comprehension or Open Domain Question Answering. It is also not clear whether the first paragraph of 2.2 is needed which explains how artificial neural networks work in general. Given the interdomain scope of this work, it might be needed or superficial depending on the reader.
“slow performance of DNNS” should be rephrased as it is not the slow performance of DNNs but the magnitude of comparisons to texts needed if you would compare all indexed texts to a query.
The numbering of compounds of DNN systems is wild and could be improved by introducing letters as second-level items.
The used domain vocabulary could be aligned in a better way, that is either there are static “embeddings” and contextual “embeddings” or static “representations” and dynamic “representations” or later “vectors”.
The description on the final layer of a DNN is rough and not correct, compare page 6, right column in https://aclanthology.org/N19-1423.pdf and rework the paragraph to the description of the said final layer. The authors then describe the fine-tuning process in sufficient detail on page 16. Here a forward pointer would stop the interested reader from asking themselves.
Footnote 7 misses a citation to Johnson, Jeff, Matthijs Douze, and Hervé Jégou. "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data (2019) as the state of the art in indexing vectors which is used for retrieval in DNNs.
On page 7, left column, the authors could improve the clarity on which graph is meant when talking about GNNs. In particular, the term knowledge graphs should be introduced beforehand given that the reader might be unfamiliar.
For the generation of text from a KG, recent works such as Moussallem, Diego, et al. "NABU–Multilingual Graph-Based Neural RDF Verbalizer." International Semantic Web Conference. Springer, Cham, 2020 are missing which do not need extensive training data.
Finally, while there are citations for different standard methods such as LSTM and Attention, the authors miss providing a citation on knowledge graph-based question generation.
The authors could improve the description of Figure 5 to help understand whether this is the proposed architecture. This does not become clear throughout the section.
## Method
The method chapter is easy to read and follow. The authors do a good job at describing details where needed despite some missing forward references, see below.
It is unclear why CIDOC-CRM was chosen. A discussion of why other ontologies were not used would help computer scientists to understand that choice. There seem to be also a variety of RDF-based GEDCOM ontologies and vocabularies available. The modeling (Fig. 9) seems to make verbalization hard, i.e. the E67 node is there as a blank node (should be a qualifier in Wikidata). So, people could wonder why CIDOC was chosen.
There are networks like REFORMER (see also https://ai.googleblog.com/2020/03/fast-and-easy-infinitely-wide-networks...) which can take input of arbitrary length.
Please explain why parents are considered second-degree relations? Depending on the chosen ontology, e.g. DBpedia, this is different.
The formating on page 12 makes it hard to follow the text flow. It would be better to align Figure 8 on top of the column.
On this page and in the pseudocode, it is unclear why the algorithm stops even after looking at the pseudocode. It would be helpful to provide an intuition here since NQ gets n enqueued.
If it is correct, that the questions were paraphrased using [46], it would be good to clarify is or if not, what is meant on page 14, left column top: “multiple variations…”.
Is SP’s grandfather Alexander on purpose not in Figure 9? Adding this information would make the example figure more valuable.
It would also be good to see an example of a multihop template in Table 1. Or does this paragraph mean, that the model picks up answering multihop questions on its own? It surely does not.
What influence does the order of verbalized sentences have? Did you do experiments on it?
Also, an example of a WH question from the DNN would be interesting to see? Did you evaluate the quality of the generated questions? Do errors in generation (e.g. wrong grammar) influence the model?
Which BERT model did you exactly use? Can you provide a pointer to the base model, e.g. on hugging face hub?
Pre-trained static node embeddings can be used in Transformer architectures and their descendants, see He, Bin, et al. "Integrating Graph Contextualized Knowledge into Pre-trained Language Models." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 2020. Thus, it would be good to rephrase the hint on page 16 right column.
Figure 11 has “Selecting a model” on an arc. What exactly is a model in this context?
Overall, the last paragraph describes the way a user would use the system. Initially, one could think the proposed approach also selects the correct GEDCOM tree from a database of trees but apparently, the user does. Thus, the system has an easier task in terms of retrieval than normal KG QA systems. It would be good to clarify that at the beginning of the section.
## Experimental design
The experimental design section is also well-written and easy to follow. The only unclear part is in the max sequence token explanation. It would be good for non-experts if the authors could explain on an example how this was handled. In particular, how the window was used if the question or the answer span was outside the window of choice and how that interacts with the learning.
## Results
The results section is quite easy to understand and up to par. An ablation study was performed on the input parameter (degree).
There are two questions left:
How does the system deal with non-existent knowledge in a tree but a question to it?
For place questions, can it be that the Uncle-BERT_2 model tries to always pick relations that are two hops away due to training, and thus Uncle-BERT_1 can find these 1-hop-away places? This could be computed based on the selected answers by the models and how far they are away in a simple table.
Minor Issues
Arxiv citations: Some arxiv citations used by the authors (e.g. [92]) have meanwhile been published in peer-reviewed journals. Thus, the suggestion is to use these references via a tool such as https://twitter.com/billyuchenlin/status/1353850378438070272
“Training of the DNN” => “Training of a DNN”
“While using DNNs for the open-domain question answering task has” => “have”
Citation [112] to [115] seem out of order
The claim “optimal DNN-based question answering pipeline” should be revisited.
|