The Apertium Bilingual Dictionaries on the Web of Data

Tracking #: 1419-2631

Jorge Gracia
Marta Villegas
Asunción Gómez-Pérez
Núria Bel

Responsible editor: 
Philipp Cimiano

Submission type: 
Dataset Description
Bilingual electronic dictionaries contain collections of lexical entries in two languages, with explicitly declared translation relations between such entries. Nevertheless, they are typically developed in isolation, in their own formats and accessible through proprietary APIs. In this paper we propose the use of Semantic Web techniques to make translations available on the Web to be consumed by other semantic enabled resources in a direct manner, based on standard languages and query means. In particular, we describe the conversion of the Apertium family of bilingual dictionaries and lexicons into RDF (Resource Description Framework) and how their data have been made accessible on the Web as linked data. As a result, all the converted dictionaries (many of them covering under-resourced languages) are connected among them and can be easily traversed from one to another to obtain, for instance, translations between language pairs not originally connected in any of the original dictionaries.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 24/Jul/2016
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here .

Review #2
By Roberto Navigli submitted on 07/Aug/2016
Review Comment:

I didn't have major comments in my previous round review. However, my comments have been taken into account.

Review #3
By Bettina Klimek submitted on 22/Aug/2016
Review Comment:

Disclaimer: This review was written together with Sebastian Hellmann.

In this revised version of the paper the authors achieved to address all of the critical aspects that have been outlined in the first review. With regard to the minor issues it can be confirmed that orthographic mistakes as well as the score of the calculation have been corrected. Also, the minor technical issues raised have been checked and resolved as far as possible. Furthermore, it is explained that the authors will include versioning information for the dataset and update to SPARQL 1.1 in the next release.

Referring to the major critical points of the previous review, it can be said that the revised paper has been extended by appropriately addressing the critical points. I.e. the following issues are resolved as follows:

*1) Vocabulary use*
The authors added a paragraph in Section 3 which explains their vocabulary use and justifies the modelling choices. The reasons presented in their discussion are scientifically sound and help the reader to understand not only what has been converted to RDF but also why it has been done in this way.

*2) Addition of translation categories*
The authors acknowledge that direct equivalent statements between senses can be established but consider this a task which requires a more careful analysis which will benefit from a deeper investigation in their future work.

*3) Linkage to BabelNet*
The missing part describing how the links to BabelNet were obtained has been added in Section 4. It also includes a quality evaluation which increases the overall quality of the Apertium RDF datasets.

*4) Quality of indirect translations*
Since the submission of the initial paper the authors conducted further and richer graph-based techniques in order to obtain indirect translations. This research has been separately published and is outlined in Section 5. The results indicate the added value of the RDF Apertium dataset in contrast to the source data which leads to simple querying of the multilingual Apertium RDF data graph.

The revised paper now presents a coherent presentation of the Apertium RDF datasets which succeeds in answering the posed questions. Further, it needs to be stressed, that this dataset proves to be up to date, constantly maintained and object of future investigations, which are desirable criteria that hold for dataset publications in general. Finally, the third party use in machine translation is now realized by collaborating with the original Apertium initiative, but also other parties are mentioned that reuse the data, which emphasizes the quality and added value of the Apertium RDF datasets within the Semantic Web and especially the linguistic linked data landscape.