LIdioms: A Multilingual Linked Idioms Data Set

In this paper, we describe the LIDIOMS data set, a multilingual RDF representation of idioms containing five languages. The data set is intended to support natural-language processing applications by providing links between idioms across languages. The underlying data was crawled and integrated from various sources. To ensure the quality of the crawled data, all idioms were evaluated by at least two native speakers. Herein, we present the model devised for structuring the data. We also provide the transformation rules implemented in our extraction framework. The resulting data set relies on best practices in accordance with Linguistic Linked Open Data Community. We also detail the link creation process as well as possible usage scenarios for the linked idioms data set
In this revised version of the paper, authors took into account most of reviewers' comments. The situation regarding copyrights has been clarified; the methods used to collect and interlink data are more detailed; and the model used to represent information is clearer.
These modifications bring significant improvements, and what has been done and why is now clear for the reasonably well-prepared reader. However, there are 2 main aspects which, in my opinion, still need to be improved.
The first one regards some remaining inaccuracies and unclear points about, among others, the building process and interlinking.
The second and major one regards the overall quality of the text which, in my opinion, needs a great deal of work. There are minor English mistakes but also, and this is much more critical, incorrect sentences, inaccuracies and inconsistencies. This is unfortunate for I believe the resource presented in this paper -- although small -- is valuable and lays the groundwork for a better handling of idioms on the LLOD.
The following subsections detail these 2 points. Regarding section 2 on text edition, comments might not be exhaustive and stop after section 5.
This paper presents the construction of a dataset for of idioms that is available across multiple languages. This dataset is quite small and I feel the use cases for this dataset could be better motivated. The paper describes four `use cases', which are really little more than SPARQL queries, and only one of which (translation) is clearly useful for other tasks. Moreover, the authors claim that their dataset is useful because existing resources such as BabelNet do not identify multi-word expression, but this is quite trivial to do (MWEs are the entries with more than one word!). The authors should better motivate why idioms (non-compositional MWEs) are of such interest. Moreover, the authors fail to give a good definition of idioms or define the criteria that they gave to the annotators. However, other works (e.g., have developed complex guidelines for this task. This is particularly troubling as the authors claim in one example that `by the book' is not an idiom, yet it is a very syntactically inflexible MWE and (this) native speaker would reject something like `we did it by his book'. Nevertheless, the manual effort that has gone into creating and linking idioms makes this a valuable resource that is of use to researchers investigating figurative language processing.
As far as I can see there is no third party use (and none described in the paper). The dataset is also quite small and it is not clearly stated why this would be of more interest to a linguist than existing larger resources such as WordNet, BabelNet etc.

In this resubmitted version of the paper, the authors have taken into account most of the recommendations made by the reviewers, but there are some aspects that still need to be reconsidered.
Following the evaluation dimensions suggested by the journal, the paper has been evaluated as for: quality and stability of the dataset - evidence must be provided; clarity and completeness of the descriptions; usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided.

Quality and stability of the dataset - evidence must be provided

The code is available in Github, but the Sparql endpoint was not working.
As for the data collection process, the authors had to perform a manual review of the automatically retrieved idioms (by a custom web crawler) to discard those multiword expressions that could not be considered idioms. Then, native speakers and linguists checked the corresponding set of idioms in their respective languages. After this evaluation step, automatically retrieved idioms were reduced by a half approximately, considerably reducing the impact of the dataset. (It is worth noting that because of potential law infringement issues raised by one of the reviewer, the authors had to previously remove idioms from Cambridge and Wiktionary).

Clarity and completeness of the descriptions

As for the Related Work section, the authors adequately refer to datasets that represent linguistic and terminological information in Linked Data. However, when identifying the gap that this resource is going to fill (namely, no specific resource in the LOD cloud about idioms), they refer to lexinfo, ontolex, etc., which are models intended to represent linguistic information in RDF. So, in my opinion, two levels are being mixed up here: the modeling level and the instantiation level. This should be clarified.
Section 4 is devoted to the representation model, namely, ontolex, and its vartrans module for representing term variants and translations. Some modeling decisions would need to be better clarified. For instance, what is the interest of including lexical concepts? How is vartrans:category defined? Why did they remove those definitions of the idioms which were in the original language (after translating them into English)? Why not using the property usage of the LexicalSense class for restricting the meaning of an idiom to a geographical area?
Regarding this last question, the authors mention that they provide an example of (and I quote) “a translation of two idioms from Portuguese to English”, but I only see one idiom and its equivalent in English. In fact, it would be interesting to see how they have solved the issue of an idiom in the same language with two different senses in different geographical areas.

Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided.

In section 6 the authors describe both the internal and the external linking processes. The internal linking process was performed manually. The external one was performed with LIMES in the case o DBnary, and with the BabelNet API or manually in the case of BabelNet.
The language in section 6.3 needs to be reviewed, especially the last paragraph in that section. It is not clear why precision was poorer in the case of BabelNet, and if it could not be easily solved.
Then, section 7 deals with “application scenarios for the dataset”. Three use case scenarios are depicted, but these do not seem to be real uses of the data by third-party users, which may limit the interest of the work presented here.