A Domain Categorisation of Vocabularies Based on a Deep Learning

Tracking #: 2073-3286

Authors: 
Alberto Nogales
Alvaro García Tejedor
Miguel Angel Sicilia

Responsible editor: 
Freddy Lecue

Submission type: 
Full Paper
Abstract: 
The publication of large amounts of open data has become a major trend nowadays. This is a consequence of projects like the Linked Open Data (LOD) community, which publishes and integrates datasets using techniques like Linked Data. Linked Data publishers should follow a set of principles for dataset design. This information is described in a 2011 document that describes tasks as the consideration of reusing vocabularies. With regard to the latter, another project called Linked Open Vocabularies (LOV) attempts to compile the vocabularies used in LOD. These vocabularies have been classified by domain following the subjective criteria of LOV members, which has the inherent risk introducing personal biases. In this paper, we present an automatic classifier of vocabularies based on the main categories of the well-known knowledge source Wikipedia. For this purpose, word-embedding models were used, in combination with Deep Learning techniques. Results show that with a hybrid model of regular Deep Neural Network (DNN), Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN), vocabularies could be classified with an accuracy of 93.57 per cent. Specifically, 36.25 per cent of the vocabularies belong to the Culture category.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/May/2019
Suggestion:
Major Revision
Review Comment:

The paper presents a classifier of vocabularies in LOV based on the main categories of Wikipedia by using word embeddings models and Deep Learning techniques. The authors show that a hybrid model of (DNN, RNN, CNN) gives an accuracy of 93.57%. The idea is very interesting and promising, but the authors need to (1) explain better some choices in the methodology (2) compare their classification with the LOV categories and (3) adhere to FAIR principles by providing a link to the materials of their experiments.

See below more details comment about the paper:

== Originality ==
The paper seems to be original as it uses Deep Learning techniques to attempt to classify vocabularies. The use of the Random Multimodel Deep Learning (RMDL) seems promising for the classification task. However, some aspects of the methodology need better understanding when applying to vocabularies, which are a special type of data as their encode “semantics” for datasets.
Data vocabularies: The dump used of the vocabularies is almost 5 months older, and 72 vocabularies were discarded. Would you elaborate more on this? Normally, each vocabulary contains many annotations (not only classes and properties). What about rdfs:comment, dcterms:abstract|title or any other annotation contained in the vocabulary? Additionally, LOV is a multilingual dataset with labels in English, French, Spanish, etc. It is not clear to me how do you take into account the multilingual aspect in your methodology (pre-procesing data vocabularies and Wikipedia corpus). The SPARQL queries in page 5 show only abstracts in English. Could you give more details?
Choice of the categories: Some categories are somehow difficult to be applied in LOV, either because they are so generic or because they are hard to be considered as domain category (Example “Philosophy” and “Technology”).
Embeddings: Vocabularies are also graphs of knowledge. Using only word embeddings could not capture also the underlying nature of a graph. Have you checked graph embeddings approaches (RDF2VEC, Graph2Vec, Node2Vec) if they can be suitable for your task?

== Significance of the results ==
The authors acknowledge in section 4.2 some limitations of their approach, and mention the use of “very general domains for the classification”.
The evaluation of the model lacks recall and F1-measure in section 3.2.3. I miss an evaluation of the approach against the manually created tags in LOV. How significant is your classification compared to the classification made by LOV curators? Currently, there are 28 vocabularies for “People” in LOV (https://lov.linkeddata.es/dataset/lov/vocabs?tag=People), and you classified 38. Which are those 38 vocabularies?
The authors should make this qualitative assessment for better acceptance of the proposed methodology.

== Quality of writing ==
The paper is well organized to understand what was the goal of the paper, and how the experiment was conducted. What is missing, is a link to a notebook to assess the pipeline and the list of vocabularies classified under each category (FAIR principles in action).
Figure 2 is not easy to read, please update
What means “vocabularies used are ontologies”?

Review #2
Anonymous submitted on 07/Oct/2019
Suggestion:
Major Revision
Review Comment:

The paper introduces an approach to automatically classify documents using Deep Learning. The classification algorithm is applied to vocabularies listed on the Linked Open Vocabularies (LoV) portal. The approach seems to be successful with a 93.57% accuracy but I would recommend the authors to clarify why an automatic classifier would be needed in the first place. The main challenge seems to be the bias in manual classification, however one can argue that applying Deep learning trained on a corpus of manually curated data (Wikipedia) also introduces a similar bias. Furthermore, the portal for LoV offers users the ability to each tag their own vocabulary. Would the vision then be to use the algorithm to suggest tags to apply to a given vocabulary? This is not clear.

Despite being interesting the approach depicted in Figure 1 is not particularly novel. It is surprising that when using DBPedia articles the authors . still decide to go back to Wikipedia to extract categories instead of using the DBPedia ones. Also, it appears that the terms extracted from the vocabularies are turned into plain english words to then be used with word embeddings. This generates the kind of issue referred to in the conclusion where "bass" could design anything. Using IRI instead of words is aimed at alleviating that risk by using unambiguous identifiers instead of ambiguous words. Furthermore it is also unclear if a dump of the LoV database or a scraped version of the site are used, as both are indicated in the paper. Lastly the evaluation section should be revised to give more information. In particular:
* A definition of "accuracy" should be provided along with an indication of the relevant gold standard.
* Table 4 should be replaced by a confusion matrix listing how vocabularies were classified versus the expected target class. Such a matrix should also come with a study of the errors and some comment on their potential source, as well as the impact of such errors.
* The SVM, Bayes, SGD and proposed approach should all be featured in a same chart/grid for easier comparison. For this sake, Table 3 could be revised into a comparison of misclassifications as found per each model. This would make it easier to choose one or the other depending on their respective strength.
* Figure 2 needs to be replaced by something easier to consume (PieCharts with similarly sized slices are hard to make sense of).
* It is unclear why the SPARQL queries are limited to 10000 results, and the offset at 0 seems superfluous. Where those queries paginated in order to reach the announced 50000 results? This should be better motivated and explained.
* Processing time should be indicated. Deep learning models can sometimes be very costly to train.

Misc comments and typos:
* Table 3 has duplicated columns
* All the captions of all tables and figures should be extended to highlight the most important information about the table/figure
* Equation (1) is blurry and should be replaced by a Latex rendering instead of an image
* Figure 2 key is unreadable
* There is no pointers for a source code repository or dataset, this makes the results harder to reproduce and compare
* Table 1 should be redesigned to have several columns per metric and one row per topic
* There is a typo in the prefix declaration of dbc

Review #3
Anonymous submitted on 06/Mar/2020
Suggestion:
Reject
Review Comment:

This paper presents an approach for classifying vocabularies based on the main categories of the well-known knowledge source Wikipedia. Word-embedding models have been used for encoding the vocabularies and then fed in hybrid artificial neural networks, mainly RNN and CNN for classification purpose.

The related work section is interesting but is missing pros/cons of the various approaches. The authors list papers which are related to the work presented but fail in motivating the needs of a novel approach. The approach seems more motivated towards the need of experimenting with artificial neural network architectures. Such related work could be ok for a conference paper, but a journal paper should expose more details on the missing elements of the various presented approaches. This will help the readers to understand the non-technical motivations.

Regarding: "It appears that no articles currently exist that assess the use of Deep Learning techniques for categorising ontologies by domain" - I do not think that is an argument for driving research. The question should be rather "what is missing in current approaches? what are the obvious shortcomings? and then motivate why artificial neural network could bring the field further".

Regarding the artificial neural network architecture: more motivation should be provided. It would be recommended to test transformers architectures, self-attention and comparing them with your approach. I guess self-attention architecture might bring an extra gain.

Regarding the training / validation: more details should be provided on the architectures (numbers of units, layers, learning rate, dropout?, any other regularization?, it is even not clear if the model is suffering from overfit as no enough details are provided). Overall the machine learning section needs to be better documented with set-up and results confirming the results obtained are not biased by an overfitting model.

Overall the results are interesting but the methodology is not clearly defined, and hardly replicable. In addition it is not clear if the model is good due to overfitting, as no details of the training / validation is provided. The paper is clearly on the application of machine learning to solve a classification task on semantic web data. I am not sure there is clear novelty unfortunately.