Glottocodes: Identifiers Linking Families, Languages and Dialects

Tracking #: 2685-3899

Authors: 
Harald Hammarstrom
Robert Forkel1

Responsible editor: 
Guest Editors Advancements in Linguistics Linked Data 2021

Submission type: 
Tool/System Report
Abstract: 
Glottocodes constitute the backbone identification system for the language, dialect and family inventory Glottolog (https://glottolog.org). In this paper, we summarize the motivation and history behind the system of glottocodes and describe the principles and practices of data curation, technical infrastructure and update/version-tracking systematics. Since our understanding of the target domain --- the dialects, languages and language families of the entire world --- is continually evolving, changes and updates are relatively common. The resulting data is assessed in terms of the FAIR (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship. As such the glottocode-system responds to an important challenge in the realm of Linguistic Linked Data with numerous NLP applications.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jeff Good submitted on 04/Apr/2021
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

Summary review
This paper provides an overview of the linked data on languoids that is available through Glottolog. Glottolog is becoming an increasingly important resource in linguistics, especially since Ethnologue has been put behind a paywall, and, in my view, it definitely merits treatment in the Semantic Web Journal since, it is a resource of potential value to any linked data project where language identification is important. On the whole the paper provides a good overview of the resource from a linked data perspective, and I believe it merits publication. Some aspects of the paper's readability could be improved, and I think it would be helpful for certain aspects of the presentation to be revised following my comments below. However, since these should all be straightforward to address, I see these as constituting minor revisions.

In terms of the two main review criteria: (1) I think there is no question of the quality, importance, and impact of Glottolog. (I see that this paper was submitted in the "Tools/Systems" category, but I wonder if it fits better in the "Dataset" category, but, either way, it's an important resource.) The paper itself demonstrates this through citation counts, but I can independently say, from personal experience, that Glottolog has had very wide uptake in the linguistics community. (2) The paper does a good job of describing the capabilities of Glottolog, and of demonstrating that it meets the criteria laid out in the FAIR principles. I think it's clarity can be improved, as indicated above, but I see my comments as representing relatively minor adjustments to what is already a high-quality manuscript.

General

-Throughout the paper, I noticed minor typographic and language issues. For instance, the use of a space in a long number ("25 695") when English conventions would normally make use of a comma or a phrase like, "An ID specifically devoted to machine treatment", which would be more idiomatic as "An ID specifically designed for machine readability". I don't think these impact the readability of the paper in any significant way, but, if the paper is accepted, it would be good for it to be carefully proofread for such issues.

-For a reader not familiar with this area, I think it would be good to add more context about the ISO 639 family of codes and, especially, ISO 639-3. I also think that the paper inaccurately conflates ISO 639-3 and the Ethnologue. These are obviously historically connected, and it is the case the SIL International is both the ISO 639-3 Registration Authority and the publisher of the Ethnologue. But, these are different resources, and my understanding is that the only information that is technically part of the ISO 639-3 codeset are the codes themselves and the language names. (That is, this is the "normative" information in the standard.) In addition, the Registration Authority can provide "informative" information, which is not, strictly speaking, part of the standard. The Registration Authority's current practice (as seen on its website) is to provide some basic metadata about a code (e.g., whether it is Active or not and whether it denotes a Living language), and then, informatively, it provides links to resources offering more detailed denotations of the code. These include links to Ethnologue and Glottolog (as well as Multitree and Wikipedia). I don't think all of these details need to be discussed in this paper, but I do think that it's important to not conflate ISO 639-3 and the Ethnologue. Also, I think the fact that the ISO 639-3 Registration Authority includes links to Glottolog can be viewed as an endorsement of its quality and impact, since this means the Registration Authority is, in effect, endorsing Glottolog as a good source for information on the denotation of the 639-3 codes. Finally, it may be worth pointing out in this paper that Ethnologue is now paywalled, but Glottolog is not, which makes Glottolog even more valuable than when it was first created. (The ISO 639-3 code tables are still open, of course.)

-Based on the content of this paper, the relationship between the languoid catalog and the references section (i.e., 'langdoc') is not entirely clear to me. I think the paper could be improved by clarifying this. Is this section conceptualized as part of Glottolog? Or, is it a separate resource? This paper seems to be primarily about the languoid catalog (which is completely reasonable), but that could be clarified, perhaps, for a reader who is new to this area. One thing I noticed about the current website is that the bibliographic resources do not seem to be associated with the same serialization capabilities as the languoids. For instance, when looking at the page for a reference, there is no RDF output of it. There is an RDF output of a list of references, but these outputs are different from the language catalog outputs. Again, I don't see this as a problem with respect to the resource but, rather, think the paper could make the scope of "Glottolog" as a dataset more explicit.

More detailed remarks:

Title: The title of the paper implies that it is about Glottocodes, but the content seems to be about both Glottocodes and some features of Glottolog. It might be nice to clarify this in the title or the text.

p.2: Can something be said about how the Glottolog editors are chosen? Is the process for selecting editors formalized at all?

p.3: "In ISO 639-3, each entry has metadata such as geographical
information, name(s), speaker numbers and classification which presumably defines the language." This is a place where I think ISO 639-3 and Ethnologue are inappropriately conflated. This information is found in Ethnologue but not the ISO 639-3 reference tables (which, instead, link to Ethnologue).

p.3: It might be helpful to list out the key domain-specific ontologies and resources that are used in the Glottolog data. A common ontology like SKOS probably does not need to be mentioned, but it looks it uses GOLD and Lexvo are used, and it is probably worth referencing them.

p.4, "which in turn provide links to other language identification schemes such as ISO 639-2": Readers unfamiliar with language code standards may benefit from a brief indication of how 639-2 differs from 639-3.

p.4, "helps researchers in Diversity Linguistics": I don't think the term Diversity Linguistics is widely used yet and is not ideal for a general Semantic Web audience. It may be clearer to write something like, "descriptive and comparative linguistics" or, perhaps, "descriptive and typological linguistics". (If the authors want to include reference to Diversity Linguistics, they could perhaps use the label, but then briefly define it and perhaps add a relevant reference.)

p.54, §4: I found some of the discussion in this section confusing. Here are some questions that I had: (i) Are the codes for entities that are not "assertable L1 languages" part of the Glottolog resource that is the focus of this paper, or are they "ancillary" information that is not seen as part of the core resource? More concretely, of the entities that can be browsed on the Glottolog website, to which do FAIR standards consistently apply? (ii) It is written that, "Glottolog makes a classification decision for all and only language-level languoids". However, aren't all dialects classified as belonging to a language-level languoid? Are there "floating" dialects? Similarly, aren't all higher-level languoids ultimately grouped into a set of top-level families augmented by a small set of top-level non-family entities? In any case, since Glottolog publishes classification decisions about the level of the language, I have trouble understanding this statement. (iii) Finally, I didn't really understand how "protection" works, especially when a language is promoted/demoted to a family or a dialect. Since it was a "language", doesn't this code need to be tracked as though it were once a language? Overall, this was the part of the paper that I had the most trouble understanding. So, it may be a good section to target for revisions to improve clarity.

Review #2
By Menzo Windhouwer submitted on 13/May/2021
Suggestion:
Accept
Review Comment:

Glottocodes is an important pivot dataset for linguistic information on the web, both on the scale of the web at large, e.g. wikipedia and wikidata, and more specific communities, i.e. the Linguistic Linked Open Data community. This paper gives a clear overview of the setup of Glottocodes and looks at it from the various FAIR principles. By doing so it is a primary example of what can be achieved with careful usage of the state-of-the-art building blocks of the web, i.e., HTTP, JSON-LD and git(hub), to publish and maintain a valuable living resource in the FAIR way.

Review #3
By Steven Moran submitted on 13/May/2021
Suggestion:
Accept
Review Comment:

Review of Glottocodes: Identifiers Linking Families, Languages and Dialects by Robert Forkel and Harrald Hammarström

Reviewer instructions:

"This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.""

Recommendation: 1. Accept

This paper is a submission to the 'Tools and Systems Report' and it is my opinion that it succinctly describes the system at hand and the data that the system produces follows the FAIR principles. Furthermore, Glottolog's impact is clear from the broad and increasing use of its glottocodes in the linguistics literature and by the thousands of citations to its data in scientific publications.

Review:

In this paper, the authors describe glottocodes, an identification system for cataloging and linking languages, dialects, and language families in the Glottolog, the most comprehensive bibliographic database about the world's languages.

At the heart of Glottolog's design decisions and technical infrastructure is the notion of using glottocodes to encode "languoids", a cover term for any language entity, including registers, dialects, languages, and genealogical and areal groupings.

This level-neutral term allows the Glottolog editors to capture the language-dialect distinction (it is a continuum rather than categorical) and to describe the language situation and status of the world's many minority, poorly documented, and low-resource languages.

By adopting a "doculect" based approach to language classification, i.e., a doculect is a languiod that has been described in published sources, the editors are able to encode concrete attestations of languages in a fluid manner and they keep up-to-date our understanding of the world's languages as it is increasingly documented by linguists.

The authors describe the motivation and history of why glottocodes were introduced in 2010 and what problems they aimed to address in light of the already existing and well-established International Organization for Standardization's (ISO) 639-3 three-letter language name identifiers.

Previous versions of ISO (ISO 639-1, 639-2) assigned two alphabetic letter codes, e.g., "en" for English, to represent the world's major languages. However, this set of 26^2 letter combinations could not encode the 7000+ languages in the world.

Thus, in the early 2000s ISO invited SIL International to prepare ISO 639-3 by integrating its comprehensive 'Ethnologue: Languages of the World' three-letter language name identifiers, which were long-established and had been continuously compiled and edited for decades:

"Each language is given a three-letter code on the order of international airport codes. This aids in equating languages across national boundaries, where the same language may be called by different names, and in distinguishing different languages called by the same name. (Grimes 1974:i)"

https://www.ethnologue.com/about/history-ethnologue

However, the Ethnologue three-letter codes -- now formally ISO 639-3 with SIL International as its register authority -- are purposefully designed as language-level identifiers.

A hotly debated issue in linguistics is the question of when two linguistic entities are different languages or are dialects of the same language. These arguments play out in the scientific literature and the Glottolog approach of assigning glottocodes to languoids based on doculects is one practical way to address this issue. (Note that ISO 639-3 does not provide bibliographic attestations for its list of languages and that the Glottolog developers maintain a mapping between language-level glottocodes and ISO 639-3 codes with reasoning when they do not agree.)

The design and dissemination of glottocodes and their metadata follows the FAIR principles for scientific data. The Glottolog data are published in the Cross-Linguistic Data Formats formal specification:

https://cldf.clld.org/

which is based on W3C's CSVW specification and allows for easy conversion to JSON-LD:

https://en.wikipedia.org/wiki/JSON-LD

The data are provided in a transparent format with an associated semantic ontology, described in detail here [1]:

https://www.nature.com/articles/sdata2018205

This makes it possible to derive, for example, alternative RDF data formats for initiatives such as the Linguistic Linked Open Data Cloud:

https://linguistic-lod.org/llod-cloud

Furthermore, the CLDF specification provides a practical delineation between tools and data. This has afforded the development of numerous third-party tools and services for data transformation and analysis in the open-source community, as documented in numerous scientific publications that use Glottolog data.

One limitation of the paper is that the authors write that the glottocode-system has "numerous NLP applications", but they do not explicitly mention any. Perhaps it is because they are so obvious, but I will mention one that I find very important and practical.

The level of granularity provided by the glottocodes system, and that it's based on scientific attestations of languiods, means that it is an ideal system to use for adhere to the so-called Bender Rule:

https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-...

In short, in NLP research one should always explicitly name the language(s) one is working on. This assists the field in engaging with, among other things, ethical issues such as exclusion, underexposure, and overgeneralization [2]:

https://www.aclweb.org/anthology/P16-2096.pdf

Consider for example the evaluation of an AI system trained on English. The Glottolog provides more than 100 different glottocodes for the various English dialects, so the precise varieties can be identified and evaluated.

[1] Forkel, R., List, JM., Greenhill, S. et al. 2018. Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Sci Data 5, 180205. https://doi.org/10.1038/sdata.2018.205

[2] Hovy, D., & Spruit, S. L. 2016. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-2096

The paper is clearly written. Here are some editorial suggestions for the authors:

* "this does not establish lock-in" --> what is "lock-in"?

* Please represent large numbers with commas and not spaces, e.g. 25 695 --> 25,695

* "make use of git a distributed" --> "make use of git, a distributed"

* "see 3.3" --> "see Section 3.3" (or "see §3.3" or whatever the SWJ stipulates)

* add spaces when needed before references, e.g. HTTP([15]) --> HTTP ([15])

* "Glottolog data is serialized as CLDF Structure Dataset ([17])" --> as a

* "Thanks to last-decade improvements" --> last-decades is weird (to me)

* This link in the PDF does not work: https://www.wikidata.org/wiki/Q31746

* I would add colons before bulleted lists, e.g., "languoids as follows" --> languoids as follows:

* Why is "Diversity Linguistics" capitalized?

* Don't need the 'cf' when footnoting URLs, e.g., Cf. https://semver.org/

* "code - i.e." --> either: code, i.e. | code -- i.e. (extra dash in LaTeX)

* In the text URLs do not resolve when they cross lines, e.g., https://glottolog.org/files/ 1 glottolog-4.0/awun1244.html -- maybe best to put all the URLs in footnotes or in the references

Review #4
Anonymous submitted on 25/Jun/2021
Suggestion:
Reject
Review Comment:

This paper describer Glottolog (https://glottolog.org ) a system for the language, dialect and family inventory. The prolem adressed is important for numerous applications in linguistics, history e.g. The presentation of the platform is clear, and important issues on data curaion , reusability and sustainability are presented. A .git is used and the data is shared in an open and FAIR way. The main mechanisms fr inking dta within the platform are the Unique Ids attached to each glottocode.

The Semnatic Web technological support is however missing. This could ensure that a earch after a certain lanuage will be successful indepndent of it's writting (e.g. the general accepted writting for classical Ethiopian is Ge'ez However this is not found, one has to write Geez). Some classification of languages is strange. For example the entire familiy of south semitic languages is missing.It is confusing to have e.g. a family "Eastern Romanian" as subcategory of "Northern romanian" At least the labels of the classes/ctegories should be revised. The GIS-like presentation is in my opinion missleading because it ties the spreading of the language to an unclear area. For an endangered or nearly exint language it is not clear, if it is spoken only exactly in the place where the point is situated, in a larger area or isolated in the entire country.