Review Comment:
Review of Glottocodes: Identifiers Linking Families, Languages and Dialects by Robert Forkel and Harrald Hammarström
Reviewer instructions:
"This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.""
Recommendation: 1. Accept
This paper is a submission to the 'Tools and Systems Report' and it is my opinion that it succinctly describes the system at hand and the data that the system produces follows the FAIR principles. Furthermore, Glottolog's impact is clear from the broad and increasing use of its glottocodes in the linguistics literature and by the thousands of citations to its data in scientific publications.
Review:
In this paper, the authors describe glottocodes, an identification system for cataloging and linking languages, dialects, and language families in the Glottolog, the most comprehensive bibliographic database about the world's languages.
At the heart of Glottolog's design decisions and technical infrastructure is the notion of using glottocodes to encode "languoids", a cover term for any language entity, including registers, dialects, languages, and genealogical and areal groupings.
This level-neutral term allows the Glottolog editors to capture the language-dialect distinction (it is a continuum rather than categorical) and to describe the language situation and status of the world's many minority, poorly documented, and low-resource languages.
By adopting a "doculect" based approach to language classification, i.e., a doculect is a languiod that has been described in published sources, the editors are able to encode concrete attestations of languages in a fluid manner and they keep up-to-date our understanding of the world's languages as it is increasingly documented by linguists.
The authors describe the motivation and history of why glottocodes were introduced in 2010 and what problems they aimed to address in light of the already existing and well-established International Organization for Standardization's (ISO) 639-3 three-letter language name identifiers.
Previous versions of ISO (ISO 639-1, 639-2) assigned two alphabetic letter codes, e.g., "en" for English, to represent the world's major languages. However, this set of 26^2 letter combinations could not encode the 7000+ languages in the world.
Thus, in the early 2000s ISO invited SIL International to prepare ISO 639-3 by integrating its comprehensive 'Ethnologue: Languages of the World' three-letter language name identifiers, which were long-established and had been continuously compiled and edited for decades:
"Each language is given a three-letter code on the order of international airport codes. This aids in equating languages across national boundaries, where the same language may be called by different names, and in distinguishing different languages called by the same name. (Grimes 1974:i)"
https://www.ethnologue.com/about/history-ethnologue
However, the Ethnologue three-letter codes -- now formally ISO 639-3 with SIL International as its register authority -- are purposefully designed as language-level identifiers.
A hotly debated issue in linguistics is the question of when two linguistic entities are different languages or are dialects of the same language. These arguments play out in the scientific literature and the Glottolog approach of assigning glottocodes to languoids based on doculects is one practical way to address this issue. (Note that ISO 639-3 does not provide bibliographic attestations for its list of languages and that the Glottolog developers maintain a mapping between language-level glottocodes and ISO 639-3 codes with reasoning when they do not agree.)
The design and dissemination of glottocodes and their metadata follows the FAIR principles for scientific data. The Glottolog data are published in the Cross-Linguistic Data Formats formal specification:
https://cldf.clld.org/
which is based on W3C's CSVW specification and allows for easy conversion to JSON-LD:
https://en.wikipedia.org/wiki/JSON-LD
The data are provided in a transparent format with an associated semantic ontology, described in detail here [1]:
https://www.nature.com/articles/sdata2018205
This makes it possible to derive, for example, alternative RDF data formats for initiatives such as the Linguistic Linked Open Data Cloud:
https://linguistic-lod.org/llod-cloud
Furthermore, the CLDF specification provides a practical delineation between tools and data. This has afforded the development of numerous third-party tools and services for data transformation and analysis in the open-source community, as documented in numerous scientific publications that use Glottolog data.
One limitation of the paper is that the authors write that the glottocode-system has "numerous NLP applications", but they do not explicitly mention any. Perhaps it is because they are so obvious, but I will mention one that I find very important and practical.
The level of granularity provided by the glottocodes system, and that it's based on scientific attestations of languiods, means that it is an ideal system to use for adhere to the so-called Bender Rule:
https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-...
In short, in NLP research one should always explicitly name the language(s) one is working on. This assists the field in engaging with, among other things, ethical issues such as exclusion, underexposure, and overgeneralization [2]:
https://www.aclweb.org/anthology/P16-2096.pdf
Consider for example the evaluation of an AI system trained on English. The Glottolog provides more than 100 different glottocodes for the various English dialects, so the precise varieties can be identified and evaluated.
[1] Forkel, R., List, JM., Greenhill, S. et al. 2018. Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Sci Data 5, 180205. https://doi.org/10.1038/sdata.2018.205
[2] Hovy, D., & Spruit, S. L. 2016. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-2096
The paper is clearly written. Here are some editorial suggestions for the authors:
* "this does not establish lock-in" --> what is "lock-in"?
* Please represent large numbers with commas and not spaces, e.g. 25 695 --> 25,695
* "make use of git a distributed" --> "make use of git, a distributed"
* "see 3.3" --> "see Section 3.3" (or "see §3.3" or whatever the SWJ stipulates)
* add spaces when needed before references, e.g. HTTP([15]) --> HTTP ([15])
* "Glottolog data is serialized as CLDF Structure Dataset ([17])" --> as a
* "Thanks to last-decade improvements" --> last-decades is weird (to me)
* This link in the PDF does not work: https://www.wikidata.org/wiki/Q31746
* I would add colons before bulleted lists, e.g., "languoids as follows" --> languoids as follows:
* Why is "Diversity Linguistics" capitalized?
* Don't need the 'cf' when footnoting URLs, e.g., Cf. https://semver.org/
* "code - i.e." --> either: code, i.e. | code -- i.e. (extra dash in LaTeX)
* In the text URLs do not resolve when they cross lines, e.g., https://glottolog.org/files/ 1 glottolog-4.0/awun1244.html -- maybe best to put all the URLs in footnotes or in the references
|