Designing information models for the etymological dictionary of Silesian geographical names

Tracking #: 2423-3637

Authors: 
Tomasz Kubik

Responsible editor: 
Eero Hyvonen

Submission type: 
Full Paper
Abstract: 
This article aims at contributing to the methodology of the structuring of etymological dictionaries of geographical names and popularization of knowledge regarding the origin of Silesian toponyms. It is based on experiences gathered during the digitization and publication in an electronic form of Słownik Etymologiczny Nazw Geograficznych Śląska, SENGŚ (‘the etymological dictionary of the geographical names of Silesia’) and addresses the problems encountered. The article discusses the rules applied in the compilation of SENGŚ and presents two information models used during the digitalization of this dictionary: a relational model and a graph model. The first one corresponds to standard approaches when designing electronic versions of dictionaries. The second allows creation of solutions conforming to the idea of Linked Open Data, which are deployable as parts of the Semantic Internet. An important aspect also considered was the linking of historical materials listed in the dictionary entries with the corresponding records maintained in digital repositories. This association was realized using Atlas Zasobów Otwartej Nauki, AZON (’Atlas of Open Science Resources’) platform.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 26/May/2020
Suggestion:
Reject
Review Comment:

This manuscript presents the digitisation of an etymological dictionary and some discussion of its conversion to RDF. The paper seems to be quite out of scope for this journal and would be more appropriate in a dedicated digital humanities journal. Further, it has been submitted as a 'full paper', instead of using the option of a 'dataset description' in this journal. As such, my review reflects this submission as a full paper in SWJ.

The resource described is while very interesting and deep, also in a sense very narrow, focussing on the etymology of place names in only a (small) part of Poland. There needs to be much more motivation about why this resource is valuable and why a general reader of SWJ would be interested in Silesian place names. There is a very detailed description of the history and purpose of the dictionary but I don't feel this relates to a computer science audience.

The main technical contribution of the paper starts in Section 3 and can be seen as mostly a process of creating a normalized database schema. This schema is then mapped to a graph model, but there is no evidence that this has been published as linked data or any attempt has been made to follow semantic web best practices in terms of interoperability, linking, open data, metadata, etc. The authors provide the RDFS axioms and discuss them, but this should be familiar to readers of SWJ already. There is no real new methodology or attempt at an evaluation of the methodology that one would expect from a full paper in this journal.

Minor issues:
p1.
24 "Semantic Internet" => "Semantic Web"
32 The abstract is just "abstract text"
43 "role in *this*"
37 "between *a* dictionary"
p2.
25 dictionaries *has* evolved
29 flexibility of solutions -built-
13 prior in their case => "ago"?
p4.
47 year? year?
p6.
22 *the* German language
p7.
5 WordNet 3.1 was released before OntoLex-Lemon, the RDF export is based on OntoLex-Lemon
p9
10 remove 'designed'
p12
8 AbstracToponym

Review #2
By Raquel Liceras Garrido submitted on 21/Jul/2020
Suggestion:
Minor Revision
Review Comment:

This paper takes an original approach to produce a digital version of a textual dictionary in two formats, web and graph. The significance of the results lies on the fact that the author has produced new resources that can be accessed and reused for other people, in addition, to share a pipeline that could be used and adapted for similar projects. The writing style is clear, understandable and easy to follow.
My comments for improvement are related to data dates, the figures and the source digitisation process. Firstly, although in section 2.1. the dates of the dictionary compilation and the dates of the entries are detailed, it would be recommendable to include a short reference at the introduction to place the reader in the time frame. Figures 1, 2, 3 and 6 must be larger; some are impossible or hard to read. The size should be the width of the page. Finally, regarding the digitisation of the source, although in the last paragraph of the paper the author points out that describing the process would require additional space. I think a couple of sentences mentioning if the source was handwritten or typewritten and how they transform it into a machine-readable format via OCR, Machine Learning, etc, it would fill a gap between the original dictionary format and the new digital versions presented in this paper.

Review #3
By Frank Abromeit submitted on 31/Jul/2020
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Summary :
The author describes the creation of an electronic version of 'The Etymological Dictionary of the Geographical
Names of Silesia’, a etymological dictionary for silesian geographical names (toponyms). The dictionary is based on
historical data going back as far as to the 15th century. It is therefore a promising language resource for tracing the
etymology of words, e.g. phonetic and morphological changes in several languages, e.g. Silesian, Polish, German, Czech,
Lusatian (Upper and Lower), prehistoric (Celtic, Illyrian), as mentioned in the text. Also metadata with references to
documentation and etymological descriptions is reported to be available, which can be linked to the data. The resulting dataset is presented online at http://sengs.e-science.pl. Unfortunately no english version of the website is available.

Evaluation :
After presenting history and content of the SENGS lexicon data in great detail, the article gives a description of the data modeling processes that lead to a relational as well to a linked data model for representing the lexicon data. The author goes into the foundations of data modeling with an extensive description of the capabilites/advantages of RDF graph database models, over traditional RDBMs. However, less is said why certain design decisions have been made. For example, why a proprietary linked data model was used in favour of already established ontologies for modeling language resources, especially lexicon data, e.g. Ontolex-lemon (https://www.w3.org/2016/05/ontolex/). Similarly the geonames dataset (http://www.geonames.org), which is somehow related to the topic of the paper, is not discussed. Otherwise i would have liked to have seen some examples of real modeled linked data, as the presented ontology in diagram (p.11) is incomplete, e.g. AbstractToponym, ComparedToponym have no RDF-properties. Finally the relation between the relational and the linked data model is a little bit unclear to me. Is the content in both models identical, and, why was a relational data model required in the first place ?

Considering the linked data model itself, i have the following questions :

1) In the SENGS ontology diagram (p.11) i don't understand why there are two classes Toponym / ToponymClass. Is this a typo ?

2) Of the features i missed were the modeling of certainty and hydronyms, oronyms and microtoponyms, that were reported to be present in the lexicon data, and which could certainly lead to more use-cases of the SENGS ontology.

3) Wouldn't it be more efficient to model the comparision of toponyms with a property instead of using a Comparison class. This depends however on what other properties are included with that class. This could not be verified because the SENGS ontology is not freely available.

Other questions i would have liked to be answered :
1) Will the linked data version of the dataset as well as the SENGS ontology will be publicy avaliable ?

Quality of writing :
The paper is well written and structured. As the text is in manuscript state, further proofreading is required because of typographical errors.

Some formal issues :
p1, snd. col. 14 VocabularIES
p2, snd. col. 17 with THE linguistic data cloud
p3, fst. col. 31 ARE presented
p3, snd. col. 30 The development of 'A' common onomastic method in Poland and Germany confirmed it.
p5, col.1 46 The sixteenth volume closes the alphabetical range of THE vocabulary
- URL for cited resource is missing : AZON (‘Atlas of Open Science Resources’)

Review #4
Anonymous submitted on 01/Sep/2020
Suggestion:
Reject
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper presents work related to creating an etymological dictionary SENGS of Silecian geographical names.

First, an introduction of different kind of dictionaries/vocabularies/gazetteers etc. is given.

In section 2, the scope, history, and content of SENGS entries are described. SENGS is clearly a result of substantial work on collecting toponyms and onomastic research.

The topic of section 3 is Model design. The section is very detailed. In some occasions it difficult to understand the writing for a person not understanding Polish, as the explanations make heavy use of Polish examples.

A detailed ER model of the dictionary is presented in Fig. 6.

It seems that the model derives directly from the structure of the dictionary. It is not clear, how generalizable the model is, i.e. how useful it is outside the SENGS context and in relation to Semantic Web research. I do not much methodological novelty or contributions here regarding Semantic Web technologies, which in my mind would be needed in this journal.

Also a graph model has been developed, which relates the work more with ideas underlying the focus of the SWJ journal. There is a fairly long presentation about RDFS semantics and reasoning, and the model contains 17 classes (Fig. 7). What remain more unclear is how the reasoning is related to the use case, i.e., what are the benefits of the graph model to the end user, the data publisher, and application developer. The presentation focuses now in my mind too much on documenting the technical details of the model. RDFS itself is known to the readers of this journal.

In the Summary, the author tells that both models were successfully used for transforming and applying the data in a prototype. Howeveer, it is not clear what were the benefits of both approaches as there is no comparison or evaluation.

Originality

There is originality in the paper regarding the SENGS data and case, but not from a Semantic Web point of view. Even if a complex RDFS-based model is presented the novelty of the approach is not clear. Related works are not discussed much, only some general models. It is not clear what are the lessons learned. The application domain of etymological dictionaries seems original, but how original is the current research in this field needs more clarifications.

Significance of results

The results are significant regarding the SENGS dictionary, but how generalizable and signification are the models to other related dictionaries needs more discussion. The paper has a very narrow focus on the Polish data that is not so interesting to the readers of this journal.

Quality of writing

The text is well-written and finished English.

Although the paper is well-written and based on substantial work where RDFS is applied, I would not recommend this paper to be published in this journal but somewhere else (in a DH or humanities journal?), because the semantic web contributions are not strong and the application focus is so narrow.

Some minor comments:

Fig. 1 is too small to be readable and it has not been explained in the text. How is the prototype actually used and why is it a significant step forward to the end user.

Capitalize all fig N. references in text. E.g., fig. 4 -> Fig. 4

Fig. 6 and 7 are too small to be readable.