Review Comment:
The title of the paper clearly illustrates the research topic, the scientific areas under investigation (Machine Translation, Historical Research), and the case study (Aramaic-Ancient Hebrew Translations).
The abstract is well written and well structured; it summarizes the objectives of the study, the methods adopted, the dataset, the main findings of the research, and suggestions for future research.
The statement of the problem, the main objectives and the contribution of this study to MT research are correct and adequately described in the Introduction. The research contributes to the advancement in the fields of corpus linguistics, parallel corpus construction and analysis, and statistical machine translation; it provides a fascinating case study, the construction of Aramaic-Hebrew parallel corpus based on the translation of the Bible, using the Corpus Encoding Standard, this corpus being an interesting case for cultural heritage recovery and preservation.
In the Background, the authors discuss exhaustively the scientific literature on Statistical Machine Translation (SMT) approaches and models provided so far, with a focus on Word alignment models (i.e. word-based translation models), such as IBM; Symmetrization, providing a deep explanation of Och and Ney’s methods, and of other models (phrase extraction from aligment data generated by IBM models, Zhang et al.’s algorithm, etc.); and Decoding algorithms. Although this study has a focus on SMT, the authors provide a wide overview and some suggested readings about Neural Machine Translation (NMT) approaches as well, since these show promising results in the field of MT. The choice for the SMT method in this study is correctly defended in the paper (p. 5): this choice is justified according to the case study characteristics and to the modular nature of SMT. The Section contains two other subsections, where the authors provide a documented overview of MT applied to ancient languages (2.3), and a complete review of Aramaic NLP studies, which is connected to one of the objectives of the study, being “a crucial step in preserving the Aramaic language and culture heritage” (p. 2). Thus, this extensive section collects and discuss critically the problems and solutions, methods and approaches adopted in MT research to date, according to the goals of the study.
Section 3 is focused on the Parallel corpus built by the authors using three Aramaic-Hebrew corpora: Targum Onkelos, Targum Jonathan, and Zohar. The authors explain suitably the characteristics of each corpus, providing interesting historical information about their composition, authorship, and impact. This is also useful for readers who are not familiar with this textual tradition. The Aramaic-Hebrew parallel Bible corpus construction is described in 3.1, where there is a quick mention to the encoding format adopted (CES) and to the level 1 annotation guidelines. Beyond being consistent with the corpus of Christodouloupoulos and Steedman (2015), it could be interesting to know whether and how this type of encoding is suitable for SMT in this specific case.
The Evaluation section discusses the main findings of the study: the assessment of the quality of the paralell corpus, the performances of the SMT algorithm trained on this corpus and on other similar Aramaic texts. The method adopted to train the SMT algorithms is correctly described; the authors explain accurately the division of dataset into three sets, the evaluation measure (BLEU), the MT algorithms adopted (statistical and neural), using open source toolkits applied on previous work. They also explain that the evaluation does not take into consideration intelligibility, word order or grammatical consistency; since the paper provides an interesting case for SMT approaches assessment, it would be interesting to know why is this so.
The evaluation results (Table 2) show that the SMT approach outperforms the NMT one, assuming that the BLEU score of the SMT method adopted is relatively high, thus confirming previous studies on the word-by-word translation techniques applied in the two ancient Aramaic translations considered (Targum Onkelus and Targum Jonathan). The authors analyze the translation errors of the SMT method adopted, using a random selection of 30 sentences; the results are classified into five categories, which are also discussed from a qualitative perspective and providing some interesting examples. Moreover, they evaluate the performance of the SMT trained model to the third parallel corpus of the Zohar, providing an accurate explanation of the lower BLEU score obtained in this case. This is further discussed in the paper, considering also the translation of the Talmud, which poses specific challenges to SMT.
The Conclusions are correct, the authors summarize the main findings and suggest new approaches for future research, such as using monolingual data or comparable corpus to improve the SMT performance and explore the translation quality of other ancient language pairs appliying this same methodology. Although, some possibile limitations of the study have been mentioned in the main sections, I suggest to recap them in the Conclusions too. Also, since the authors assert that NLP may be useful for preserving cultural heritage and endangered languages, it would be interesting to develop this idea a little more.
References are complete, relevant, and updated.
FORMAL ASPECTS
The structure of the article is correct, and the style is adequate. However, I suggest to review some typos and revise some stylistic aspects:
Page 2 line 2: word > work
Page 2 lines 14-18: The CES is simply mentioned here and in pp. 7-8, please provide some more info about this encoding standard here.
Page 2 line 32-33 The systems are then able to translate previously unseen sentences.
Page 3 lines 6-10: please reformulate; it is not clear what is the connection between these two sentences; "translation systems produce alignments between source and target sentences" ... "However, data available ... only contains sentence pairs".
Page 3 line 221: The one-to-many mapping; please briefly explain this concept
Page 3 line 40: world > word
Page 3 line 47: fertility model; please briefly explain this concept
Page 3 lines 49-51: revise syntax
Page 4 line 40: The cost of a new state is the cost of the original state multiplied with/by? the translation, distortion…
Page 5 line 12: BLEU points; There is a mention to the BLEU here, but you explain it in p. 8; it may be useful to have some indication about it here too
Page 5 line 18: theory; I'd rather say 'the assumption' or 'the theoretical assumption'
Page 5 line 39: While > Though,
Page 6 lines 22-23: the language > this language
Page 6 line 34: since > Since
Page 7 lines 1-8: … the central work of Jewish spiritual literature, also known as Kabbalah. The Zohar is a series of books with a commentary on the spiritual elements and scriptural interpretations of the Pentateuch, as well as mysticism, mythical cosmogony, and mystical psychology. The Zohar scriptural exegesis can be viewed as an esoteric…
Page 7 lines 37-43: please, revise syntax
Page 8 lines 32-33: was trained on our parallel corpus, and on other Aramaic text
Page 10 line 23: some sentence > some sentences
The paper can be published after minor changes
|