Review Comment:
The submitted paper talks about a data set extracted from several language editions of the freely available Wiktionary Wikimedia project. The data is converted via a software framework, made available online under open licenses and hosted as Linked Data.
The data set has a good uptime and everything I tested, was technically working and on a very high level. Also the usage of the lemon vocabulary, where applicable, seems semantically correct. Still there are some issues with the paper as well as with the data set, which I will outline in the remaining sections.
The paper should receive a major revision for not including relevant information in the text. The work as such is quite good, so I have a good feeling, that the author will succeed in submitting an acceptable revision of the paper.
# Details
## The title seems to have a spelling mistake: English is "resource". The double "ss" is the French (and also German) spelling of resource.
## Vocabulary
I was not able to access: http://kaiko.getalp.org/dbnary#Vocable, however and therefore I was not able to look at the lemon extension created by dbnary.
Getting this straight is always a nuisance, especially with linked data and '#' URIs as the part after '#' gets cut away during the http request. Maybe, you can copy some .htaccess rules from here to do static file hosting of the schema: https://github.com/NLP2RDF/persistence.uni-leipzig.org/blob/master/ontol...
Switching to '/' might solve the problem, as well.
- I noticed however that dbnary:Vocable is capitalized in the paper, but in the data it is dbnary:vocable . What exactly is the difference between lemon:LexicalEntry and dbnary:Vocable? Subclassing should imply, that there exist resource that are of type lemon:LexicalEntry, but not of type dbnary:Vocable. Is this the case in the dbnary data set? Otherwise, the only distinction criteria would be that Vocables where extracted from Wiktionary, but this might not justify an extra OWL Class.
- dbnary:Equivalent might be quite misleading, as it is not clear, what exactly is equivalent. In translations the "equivalence" relation does normally not hold between source and target. For the word "cat"@en, "chat"@fr, "Katze"@de, the expected gender does not match in a way I would consider "Equivalence", i.e. the complete agreement of all properties (with my Leibnitz hat on) . Why not simply call it dbnary:Translation?
- The explanation for dbnary:glose is lost on me: "used to dentate the lexical sense of the source of the equivalent"
- dbnary:targetLanguage could link to lexvo.org instead of being a literal
- Figure 2 and 3 are quite confusing as they display a class dbnary:LexicalEnty and I am not so sure whether this should be "Entry" or "Entity" now.
## Usefulness
The usefulness of the data is obvious. The paper would gain a great deal, if the use cases were made explicit and really well explained. Please clarify why and for what exactly GETALP and LIG needs this data. It remains quite vague in the paper. Could you extend upon this and maybe even a concrete example? This would help motivate the work you did on extracting RDF from Wiktionary. Do you already have any hits on your endpoint or linked data interface or any reported usage of your resource?
## Quality
The data you submitted in the paper is quite raw and I am not really able to judge the quality of your extracted data. One way to improve upon this is to elaborate for what you are using the data (see section above). It might be possible to deduce that your data quality is sufficient to be useful in certain use cases and NLP methods.
Several other projects and approaches have tied to evaluate their Wiktionary extraction. Some of these statistics would be nice for dbnary as well:
http://code.google.com/p/wikokit/#Statistics
http://svn.aksw.org/papers/2012/JIST_Wiktionary/public.pdf
http://downloads.dbpedia.org/wiktionary/stats_2013_04_06.csv
created with this query: Select ?g ?p count(?p) as ?count where { Graph ?g { ?s ?p ?o } } group by ?p ?g order by desc (?g) desc(?count)
http://www.igi-global.com/chapter/ontowiktionary-constructing-ontology-c... (not openly available online)
You might even have a look here to compare with official stats:
http://meta.wikimedia.org/wiki/Wiktionary
I am writing this, because I found table 3 suboptimal. Ordering alphabetical by ISO code is confusing. The table as such takes up a lot of space, but does not really provide any insight. It might be better moved to a HTML page at http://kaiko.getalp.org/about-dbnary/ as extended statistics.
There is a way to query the MediaWiki api for interwiki links. See e.g.
http://en.wiktionary.org/w/api.php
http://en.wiktionary.org/w/api.php?action=parse&page=flight&format=json
"parse":{
"title":"flight",
"revid":20059852,
...
"iwlinks":[
...
{
"prefix":"fr",
"url":"http://fr.wiktionary.org/wiki/vol",
"*":"fr:vol"
},
{
"prefix":"fr",
"url":"http://fr.wiktionary.org/wiki/fuite",
"*":"fr:fuite"
},
...
This might be used to evaluate table 3 and your extractor for translations (not perfect, of course, but it would help)
I am fully aware that evaluation is quite difficult. I don't expect, that you will realize all things I wrote above. But I would definitely require that the second submission contains a better evaluation of quality in some form.
## Namespaces
Are not resolved in the paper. They should be at least once so readers know, what dbnary stands for. Daily votes at http://prefix.cc/dbnary also help establish this namespace.
## Interlinking and related work
There is also a similar approach as a subproject of DBpedia, called Wiktionary2RDF.
Sebastian Hellmann, Jonas Brekle, Sören Auer: Leveraging the Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Data Cloud. JIST 2012, http://svn.aksw.org/papers/2012/JIST_Wiktionary/public.pdf
Note, that I am not trying to coerce a citation (http://en.wikipedia.org/wiki/Coercive_citation). Your paper is about your data set and criteria are quality, usefullness and completeness of description, not so much about the comparison with other approaches. However, mutual links between the data sets would be useful. There might even be a way to merge or fuse both approaches in the future, although I am not yet sure how.
We are also double-typing some resources as LexicalEntry and and LexicalSense, so schematically, the dbnary-lemon extension would be applicable for this project as well.
## Interlinking part 2
Is the data set interlinked to anything?
ask from {?s owl:sameAs ?o}
on http://kaiko.getalp.org/sparql returns false
## Technical details about the data:
### Seems like http://datahub.io/dataset/dbnary/resource/2002de88-2f86-48c6-a24c-f70d0e... was uploaded to CKAN. Although technically available for everyone, this feature was created for people without hosting capabilities (e.g. researchers from humanities ) and it should be used with care.
### Linked Data is working for URIs. IRIs do not seem to be supported, compare:
curl -L -H "Accept: text/rdf+n3" "http://de.dbpedia.org/resource/Rüdiger"
with
curl -L -H "Accept: text/rdf+n3" "http://kaiko.getalp.org/dbnary/fra/thésaurus"
There has been a proposal for Transparent Content Negotiation rules here:
Internationalization of Linked Data. The case of the Greek DBpedia edition, Dimitris Kontokostas, Charalampos Bratsas, Sören Auer, Sebastian Hellmann, Ioannis Antoniou, George Metakides, Jornal of Web Semantics: http://www.websemanticsjournal.org/index.php/ps/article/view/319
However, it is difficult to set up, if the tools does not support it out of the box. Virtuosos can be configured to do so, but it is not an easy task.
### The graph name in the virtuoso contains a trailing '/' (http://kaiko.getalp.org/dbnary/) . I am not sure, what the best practice is here. Personally, I prefer no '/', but I am not insisting, rather asking.
23613592
## Minor
### The article contains a lot of capitalization issues:
- Wikipedia as well as Wiktionary should always be capitalized
- Multilingual Information Retrieval -> multilingual Information Retrieval
- extract Multilingual Lexical Data -> multilingual lexical data
- Why is Wikimedia italic?
- (Study group -> (study group
### Please check. IIRC correctly footnotes go after punctuation:
language edition^2. -> language edition.^2
|
Comments
Resubmission uploaded
A resubmission of this article has be uploaded with tracking number 504:
http://www.semantic-web-journal.net/content/dbnary-wiktionary-lemon-base...