OLiA -- Ontologies of Linguistic Annotation

Tracking #: 409-1523

Authors: 
Christian Chiarcos

Responsible editor: 
Guest editors Multilingual Linked Open Data 2012

Submission type: 
Ontology Description
Abstract: 
This paper describes the Ontologies of Linguistic Annotation (OLiA) as one of the data sets currently available as part of Linguistic Linked Open Data (LLOD) cloud. The OLiA ontologies represent a repository of annotation terminology for various linguistic phenomena on a great band-width of languages, they have been used to facilitate interoperability and information integration of linguistic annotations in corpora, NLP pipelines, and lexical-semantic resources.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Menzo Windhouwer submitted on 28/Jan/2013
Suggestion:
Accept
Review Comment:

This paper contains a well written description of the OLiA architecture, which can function as a pivot to crosswalks between annotation schemes, tagsets or term bases.

(1) Quality and relevance

Section 3 Data Set Description and Section 4 Applications show OLiA ontologies broadly cover the various knowledge bases constructed in the domain of linguistics and can be successfully applied to documentation and semantic interoperability tasks.

(2) Clarity and completeness of the descriptions

I found the paper clearly written and have only some minor comments to improve the paper:

* in section 2 or 3 add some small examples of the various mappings, e.g., olia:Determiner how does it get linked to a tag set or term base?
* section 5 "... become more obvious to ISOcat developers, but these ...": although the developers are concerned, in the case of an open registry like ISOcat its especially the community that populates/uses the registry that should become more aware; maybe "... obvious to ISOcat users, ..."

Some additional comments:

* nowadays the TDS ontology is public available under a CC-BY license at http://languagelink.let.uu.nl/tds/ontology/LinguisticOntology.owl

Review #2
Anonymous submitted on 12/Feb/2013
Suggestion:
Minor Revision
Review Comment:

The authors describe an interesting data set for the linguistic domain that can be used as an intermediate layer between different annotation schemes, as well as existing terminology repositories.

The paper is clearly written and the data set itself is well motivated. Advantages are made clear during the article. The design principles, as for example using OWL DL as expressive modeling language, are understandable and sound feasible.

Existing data sets in the domain are compared comprehensively, pros and cons are well discussed.

The data set was developed and extended over several years, resulting in a good quality. The data set is not only created as an additional data set besides existing ones, but provides also alignments to them, which indeed makes it possible to use the OLiA Reference Model as intermediate representation.

The relevance and usefulness is, on the one hand, given by the integration in the growing LLOD cloud and being part of the NIF standard, but, on the other hand, also shown by applications, e.g. applying several NLP algorithms and integrating the heterogeneous results easily and correct.

I would suggest a minor revision now, and if the authors fix the minor issues below accept the paper afterwards.

Minor comments:

On page 2:
* There are two different values for the number of MorphosyntacticFeatures(16 and 4), so I guess the latter one refers maybe to MorphologicalFeature.
Additionally, if the reference model is loaded into e.g. Protege, if the taxonomy view is correct, then there are 17 subclasses of MorphosyntacticFeature, compared to 16 mentioned by the authors.

On page 3:
* Reference 29 is followed by a comment in the same brackets. The comment refers also to another data set, so the comment should be included in separate brackets.
* "In a similar vein, OLiA can be employed in NLP pipeline systems and other NLP pipeline systems for tagset-independent, interoperable information processing[1]." There is twice the phrase "NLP pipeline systems" in the same sentence, sounds scary.

On page 4:
* "...minimizing the number of mappings necessary establish interoperability of one ..." There is a word("to") missing between "necessary" and "establish".
* I'm not sure if it is useful to introduce notations (| and +) of reference 24, because it is used nowhere in the rest of the text.
* The class expression (gold:Quantifier \sqcap \not gold:Determiner) breaks the layout.
* What is owl:join? In OWL, this is definitely not contained. Do the authors mean the owl:unionOf construct? Should be fixed or clarified.

Review #3
By Steve Cassidy submitted on 03/Apr/2013
Suggestion:
Minor Revision
Review Comment:

This paper describes the Ontologies of Linguistic Annotation which provide an OWL ontology describing terms used in annotation of language resources for linguistic research. Overall, the presentation of the ontology is good and the ontology itself provides a useful resource in the context of a number of projects that are building RDF representations of Linguistic Annotations. The paper falls down a little in describing the motivation for developing this OWL version of the existing GOLD and ISOCat vocabularies and in discussing how this ontology might be more sustainable than the earlier standards.

The following are some rough notes on the paper:

Introduction doesn't really motivate the problem sufficiently, not clear what problem OLiA is solving

There are already a few 'standard' vocabularies in this domain which are widely used, GOLD and ISOCat are cited. we're told that neither of these is suitable but not really why. OLiA is introduced as a third vocabulary that somehow engineers compatibility for applications with both of these other vocabularies and possibly more. But it's not clear how OLiA improves on the situation, adding another standard that needs to be maintained with no idea how it will be better than the 'community based' processes referenced for the other two. Clearly the author is in a good position to make changes for his work but how does this help a third party who now needs to decide whether to petition GOLD, ISOCat or OLiA to make a change or include a new concept.

In section 4 there's an example based on a query system. The author doesn't really explain the difference between the cases which are just querying for strings ("NX", "NP") and the OLiA case which is querying for a namespaced RDF term. In general there's not enough discussion of why RDF is being used or that the vocabularies being referenced are not RDF based.

In S4:
"employed in NLP pipeline systems and other NLP pipeline systems for tagset-independent, interoperable information pro- ceasing"

the second "NLP pipeline systems" is redundant. Also, this example is not explained at all, how is OLiA used in this context, are there alternatives?

Fig 1. is slightly confusing. The concept PDAT (stts:PDAT form the text) is show as instance_of stts:AttributiveDemonstrativePronoun which is_a olia:DemonstrativeDeterminer but in the text both of these relationships are described as superconcept/subconcept, which suggests is_a to me. Also PDAT is drawn as an ellipse whereas the higher classes are rectangles suggesting abstract classes. Are these higher classes conceptually different to PDAT - could they have a hasTag relation for example?

Also here there is a suggestion of a process for disambiguating a tag used in annotation via the string representation "PDAT" matching to a literal in the ontology. Clearly it could be the case that two distinct concepts from different vocabularies have the same string tag (this is the point of using namespaced vocabularies after all). So, is there another part to this process (presumably another input is the vocabulary that has been used for annotation). Is the annotation actually stored as a string tag rather than as a reference to stts:PDAT?

End of p3: "One application are ensemble combination ar- chitectures," - also it's not clear this is a good way to explore this kind of algorithm. POS taggers are generally trained on tagged data and generate the same tags on new data. One could re-train any tagger on uniformly tagged data to get that kind of output and then combine the different taggers in parallel as described. Doing it the way the paper suggests means you don't need to re-train the taggers; apart from avoiding this extra work is there a real benefit here?

P5: "Unlike a direct mapping approach, OLiA allows to recover informa- tion" -> "allows one to recover"

in the discussion of combining concepts in section 5 the author says:

"Many tagsets for part-of-speech annotation, for example, introduce hybrid categories to represent either conceptual overlap/fusion or ambiguity using OWL/DL constructs..."

which suggests that there are tag sets in use that are defined by OWL vocabularies. I'm not sure that this is the case and the example that is given (of Penn Treebank) certainly doesn't use OWL/DL. Is what the author is trying to say here that these vocabularies can be modelled in this way or that the OWL/DL way provides a more formal way to model this kind of variation? To clarify, the author here is describing the case where a tag in one vocabulary is not the same as that in another vocabulary but is the disjunction of two tags or the compliment etc. It might be good to see an example of the other approach that the author alludes to with Penn Treebank.

ill defined data categories: criticism of ISOCat but what process or property of OLiA would prevent the creation of a tag that was ambiguous from the point of view of some researcher or theory. So the comment is that this confusion was only found by formally modelling the vocabulary as an OWL ontology. I guess the question is whether this approach is amenable to the kind of person who is developing tag sets - that is could a linguist make use of these tools to discover issues like this or do they need to find a tame SW engineer to do it for them?