Semantic Quran a Multilingual Resource for Natural-Language Processing

Tracking #: 503-1701

Authors: 
Mohamed Sherif
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Guest editors Multilingual LOD 2012 JS

Submission type: 
Dataset Description
Abstract: 
In this paper we describe the Semantic Quran dataset, a multilingual RDF representation of translations of the Quran. The dataset was created by integrating data from two different semi-structured sources and aligned to an ontology designed to represent multilingual data from sources with a hierarchical structure. The resulting RDF data encompasses 43 different languages which belong to the most under-represented languages in the Linked Data Cloud, including Arabic, Amharic and Amazigh. We designed the dataset to be easily usable in natural-language processing applications with the goal of facilitating the development of knowledge extraction tools for these languages. In particular, the Semantic Quran is compatible with the Natural-Language Interchange Format and contains explicit morpho-syntactic information on the utilized terms. We present the ontology devised for structuring the data. We also provide the transformation rules implemented in our extraction framework. Finally, we detail the link creation process as well as possible usage scenarios for the Semantic Quran dataset.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Riccardo del Gratta submitted on 23/Jul/2013
Suggestion:
Minor Revision
Review Comment:

The structure of the paper did not change from its previous version, some of my comments have been addressed, some other haven't. I think an example on the mapping between the structures chapterIndex/verse/verseText and the LOCATION FORM TAG FEATS could be useful for readers to be clarified in the metodology.

You decided to use GOLD. Could you justify this choice? Is GOLD more suitable for your purposes when compared to lexinfo for example?

Review #2
Anonymous submitted on 30/Jul/2013
Suggestion:
Minor Revision
Review Comment:

This paper presents the Semantic Quran dataset, a dataset that contains information in 43 languages, including Arabic, Amharic and Amazigh.

First, the authors describe the datasets from which the data has been extracted: Tanzil and the Quranic Arabic Corpus. Then, the present the ontology they have designed to provide description of the “localization” or position of data in the Quran and also morpho-syntactic descriptions.

When defining the ontology, they state that it as a “general-purpose linguistic vocabulary”. I would say that this description has a much wider scope than the one of the ontology they are proposing, which is explicitly tailored to represent the information of the Quran. In this sense, I would suggest that they reconsider this definition.
Regarding “localization” or provenance information, I realize they have not reused standard provenance vocabularies, such as PROV-O, but

In my previous review I said: “Regarding multilingualism, which is one of the main characteristics of this dataset, the authors have simply relied on the rdfs:label property, and have assigned the corresponding language tag to the label. This seems to be enough because only one translation is provided for each preferred label. Why not using skos:altLabel? Or even prefLabel with the corresponding language tag? What if different alternatives were provided for each label, maybe coming from different resources? Why not representing each label as one lexical item and then using “translation or equivalent links” between them? I would suggest the authors to justify this”. This has not been approached by the authors in this new submission.

As for the linking phase, it is still not clear to me why they do not take advantage of the morpho-syntactic information contained in the resource. Could they further clarify this in section 5?

Spelling mistakes:
• Arabis ->Arabic (section 2.2)
• … can be improve -> improved (end of section 5)

Review #3
By John McCrae submitted on 12/Dec/2013
Suggestion:
Accept
Review Comment:

This paper presents the publishing of a linked data corpus derived from the Quran in 42 languages. The scope and number of languages represented makes this a clearly useful resource for NLP. The resource appears to be available now and of high quality.

My comments from the first round of review seem to have been addressed well, although I am not convinced that "GOLD is the most exhaustive ontology for modelling linguistic properties", however this point is very debateable.

Minor:
"witch stands for" (misspelling in several places)
Reference 6. Give full first name of author not initial
Reference 8. Capitalize SPARQL correctly