Data-driven Modelization and Knowledge Graph Generation within the Tourism Domain

Tracking #: 3155-4369

Authors: 
Diego Reforgiato Recupero
Luca Secchi
Alessandro Chessa
Gianni Fenu
Francesco Osborne
Angelo Salatino
Enrico Motta

Responsible editor: 
Katja Hose

Submission type: 
Full Paper
Abstract: 
The tourism and hospitality sectors have become increasingly important in the last few years and the companies operating in this field are constantly challenged in providing new innovative services. At the same time (big) data has become the “new oil” of this century and Knowledge Graphs are emerging as the most natural way to collect, refine, and structure this heterogeneous information. In this paper, we present a methodology for semi-automatic generating a Tourism Knowledge Graph (TKG), which can be used for supporting a variety of intelligent services in this space, and a novel ontology for modelling this domain, the Tourism Analytics Ontology (TAO). Our approach processes and integrates data from Booking.com, AirBnB, DBpedia and GeoNames. Due to its modular structure, it can be easily extended to include new data sources or to apply new enrichment and refinement functions. We report a comprehensive evaluation of the functional, logical, and structural dimensions of TKG and TAO.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 29/Jul/2022
Suggestion:
Accept
Review Comment:

This paper presents a methodology and an ontology for the tourism domain. It is easy to read and has a high quality of writing.

Concerning originality, I consider that it is very appropriate for this journal: the methodology is the most interesting to me, but also the domain considered is relevant and the paper provides a good review of previous approaches and has relevant contributions.

Therefore, I consider that the results are significant and valuable for the community.

The data files are mostly in github. In my opinion, a "Long-term stable URL" for resources. The data files are well organized and well documented which is a guarantee for its reproducibility.

Although I want to congratulate authors for this good paper, despite it is 37 pages length, I only have a few minimal comments:

- the number of inference verifications (10) and "error provocation" (8). In my opinion, these numbers should be higher. How much higher? I would be more confident if these numbers were double.
- Verification of the ontology with Oops! (https://oops.linkeddata.es/). This tool is a reference for ontology validation.

Review #2
By Eva Blomqvist submitted on 04/Sep/2022
Suggestion:
Minor Revision
Review Comment:

The paper presents the method and process for creating a tourism ontology, including code and tools for generating parts of the ontology and related knowledge graph. The paper is well written, and detailed. It is also quite long, and in some sections feels a bit repetitive. However, overall it is a good paper, that could be quite valuable to many practical ontology- and KG-engineering projects.

Regarding novelty and originality, the paper is not extremely novel, in that it describes quite a typical task, and a typical outcome, i.e. yet another tourism ontology and associated KG. However, it is the level of detail, amount of reusable resources, and discussion of the detailed issues, solutions, and lessons learned that become valuable to other researchers and practitioners. Also, the fact that some parts of the ontology are generated automatically by a script, is somewhat novel in itself. Regarding significance, the scientific value of the paper is also not extremely high, but in my opinion that is outweighed by its practical value. It really shows, from start to finish, how a large-scale ontology and KG generation project can be set up, what are the issues, and how can the results be evaluated. The quality of writing is also quite good, the paper is mostly clear, and does not contain many language issues, and few unclear or unsubstantiated claims.

Regarding the associated data and code, it seems to be available and it is documented with read-me files that explain how to run the code. I have not tested it, but it looks like it could be replicable without too much effort.

Based on these overall remarks, I suggest that the paper would be accepted with just some minor revisions. Below, please find some more detailed comments and questions, which should be addressed in such a revision:

The only comment I have that could warrant a larger revision is that some paper sections feel almost overly detailed, and a bit repetitive. It may sound a bit inconsistent, since the level of detail is on the other hand one of the merits of the paper. However, I think this is more a matter of improving the reading experience, than actually leaving something out. Perhaps some parts could be lifted out into an appendix, if that is allowed by the editor? Or collected in a larger table? It is in particular the subsections of section 3.4, which many times mostly consist of bullet lists, and the triple structures explained in the subsections of 3.5, that I feel become a bit hard to read. In this part of the paper I might have rather preferred just a “running example” throughout the text, and then the details and exhaustive listings of all the options and categories in some appendix or table. However, I realise this might be hard to achieve. It would also be beneficial to separate a bit more the “general” methodology, from the example of the TAO - which parts are generic and can be used for other domains, and which are TAO-specific?

Further, I have an issue with the paper title. The term “modelization” is used in the title, but is not used otherwise in the paper. And although it seems to be a correct term, it is perhaps not the most common one to use when talking about ontology modelling. In fact, when first reading the paper I had a hard time understanding the connection to the title, and had to look up the term. Perhaps it would be better to use a more common term, in the title, and at least to be consistent with the rest of the paper.

The related work section starts out well, but then on lines 39 and onwards there is no longer any proper comparison of the described approaches to the one presented in the paper. Also the last few approaches described there need to be compared to the current work.

The methodology that is nicely illustrated in Figure 1, is that the authors’ own invention, or based on some existing methodology? Citing the methodological basis of the described methodology would be suitable here.

It is not entirely clear how the use cases in section 3.1 are connected to the requirements (e.g. CQs) in section 3.3.1. For instance, use case 1, about topics of interest in reviews, does not seem to have any corresponding CQ in section 3.3.1. Overall topics of interest appears several times in the use cases, but not in the requirements at all. Some CQs are also all formulated as “examples”, i.e.with what I interpret as concrete instances mentioned, such as “Wi-FI”, and a concrete distance of 2km, while others are formulated in a much more generic way. Is there a reason for this difference, or can they be formulated in a more uniform manner?

It is a bit unclear if and how the Hontology is used. It is mentioned together with other reused ontologies, and described in detail on page 11, but then does not seem to be reused in the same way.

On page 10, the STI Accommodation ontology is mentioned on line 25, then Accommodation ontology is mentioned on line 32 - do you refer to the same ontology here? On the same page, the authors state “We reused…” on line 46 - what does reuse actually imply here? Import? Mentioning the URIs? Something else? Same on page 11, line 22, where the authors state that “We reused… by importing and extending a few classes…” - How did you do that? Import generally brings in the whole ontology, not just a few classes, so how did you just reuse a few classes then?

On page 11, in the paragraph about DBpedia both the term subsumption hierarchy and taxonomy are used - is there a difference, or are they synonyms? If yes, then why use two different terms.

At the beginning of section 3.3.3 the first aim states that the ontology should be “compatible with all the requirements”. What does that mean, i.e. to be compatible with? Shouldn’t all the requirements be fulfilled?

Overall, in section 3 it is not so clear what the actual scope of the ontology is. It was not until I got to the list in 3.3.3 that I understood that the aim is only to model lodging, not any other kinds of facilities or events. This could be made more clear from the start.

Figure 2 is illustrative and gives a nice overview, although it is a bit too small to be readable in all its parts, at least when printing the paper. Also, it is not entirely clear how to interpret the arrows - do they refer to domain and range of properties, or something else?

Line 31-32 on page 13 has two problems: first, it exemplifies a more general problem of paragraph divisions in the whole paper. Usually a paragraph should consist of more than one sentence, but in the paper there are many paragraphs with just one sentence, which should be fixed. The other issue is with the term “entity linking” - I don’t understand how entity linking could be used here? Or do you mean entity recognition?

I understand that the authors want to be comprehensive, but I am wondering if it is wise to include such detailed taxonomies into the ontology, as described on page 14, lines 36-43. In my experience, this is usually where ontologies go wrong, i.e. over-specifying bits that could easily be extended by later reuse, and instead become much less reusable, since there will always be some little bit in the taxonomy to disagree with.

Page 15, line 41, do you mean disjointness?

Page 16, lines 4-6 - this is VERY specific. I am sure that lots of users of the ontology would rather set these limits themselves. This seems like overspecification.

Section 3.3.4 discusses the way the ontology is generated from code. Here some example could be suitable, since this is a novel way to mange the process. Additionally, I would expect some discussion and comparison to similar approaches. For instance, I wonder why OTTR templates where not considered useful for this task, since they seem to have a similar function? This should be discussed.

It is not entirely clear what the numbers in Figure 4 are supposed to mean. Do they give an order of the steps?

Sections 3.4 and 3.5 would benefit from a running example that can be followed throughout the sections.

Section 3.4.6: Why only restrict the types of entities in the filtering step, why not from the beginning, when spotting & selecting candidates etc.?

How scalable is the data transformation process? Interms of time, space etc?

It is a bit unclear in section 3.6.1, whether this is just a description of the process, or if this is data that is actually generated and stored, i.e. does Figure 7 only describe a workflow, or does it describe the data collected and stored about the workflow?

Regarding the evaluation section, it is not very clear from the beginning whether you aim for mere validation of the solution, e.g. that it fulfils its requirements, or an evaluation, i.e. studying “how good is it” with respect to some quality aspects.

A more detailed discussion of what test data was used, should be included in section 4. Was the test data different from the data used to develop the ontology? Otherwise the ontology could be biased towards that specific set of data?

Top of page 33: I am not sure I agree that you can draw all those conclusions, simply based on a few measures. For instance, cognitive ergonomics seems to be a complex aspect, still there are only three simple metrics used to assess it. Here I would have expected an additional user study or similar, to be able to really state that it has good cognitive ergonomics. This should not be interpreted as that I think the evaluation is bad, in fact it is much better than what I usually see in most paper, but the authors draw sometimes a bit too bog conclusions from it. The conclusions can be expressed a bit more modestly, and limitations should be discussed.

Future work on page 34 could be extended by saying exactly how the ontology could be extended, as well as reused. Now the section is a bit vague.

Review #3
Anonymous submitted on 15/Sep/2022
Suggestion:
Major Revision
Review Comment:

This paper discusses a use case for the tourism section whose data heterogeinity is tackled with knowledge graphs. The paper proposes an ontology (the Tourism Analytics Ontology (TAO))
for the tourism section and a methodology to apply this ontology to raw data and genernate a knowledge graph (the Tourism Knowledge Graph (TKG)).

The paper is well-written and well-explained overall, perhaps even overly explained as it often repeats and makes it tiring to follow. In principle the paper could easily be half in legth if It didn't repeat so mcuh. I will list here a few points that I noticed that the paper repeats but I would recommend to the authors to go through the paper carefully and identify all the places where information repeats.
For instance,
- the end of the related work section repeats what it was mentioned in the introduction
- The data source description repeats what was mentioned in the use case description
- The subsection on the ontologies reuse repeats what was mentioned in the related work section
- The ontology section repeats what it is discussed in the requirements section
- The taxonomies descriptions repeats in several places
- The data transformation comes 2 times in 1 section, first a summary and then a slightly more extended version.

**Abstract/Introduction**
In both the abstract and the introduction some claims are quite strong:
- Why is the ontology novel? Later on we see that a large part of the ontology is based on reusing existing vocabularies, that would not characterize it novel, it can be new but not novel.
- It claims that the approach is modular but it is not clear why it is more modular than other approaches. Why other solutions cannot be extended and this can?
- It is not clear why the "characterization" of the domain is more comprehensive? Was it compared?
- Why would this solution is more easily reused than others? How is it proved?
I would suggest to the authors to avoid such claims if they cannot be proved.

**Related Work**
The contributions list consider an ontology, a methodology for applying the methodology to the data, an implementation of the methodology and a knowledge graph. Considering these 4 contributions, the related work section should have covered each aspect but this is not the case. Existing ontologies are indeed discussed but the methodologies to define these methodologies are not. However, an ontology is developed and a certain methodology is followed but we don't know how the used methodology compares to existing methodologies and how this impacts the ontology's design.
Then a methodology is discussed on how the knowledge graph is generated, but there is no discussion regarding other methodologies. For instance, how is the proposed methodology similar to the pay-as-you-go methodology? https://doi.org/10.1007/978-3-030-30796-7_32? Moreover this paper might be a good source of information for knowledge graph development: https://doi.org/10.1145/3522586
In this section, it is also mentioned that a fine-grained description if tourism accommodation is not available, but it is also not proven that the proposed one is more fine-grained than the existing ones.

**Methodology**
As I already mentioned, a certain methodology is followed but it is not specified how it is different compared to existing methodologies.

It is mentioned that the results of the data analysis showed a set of issues with the data but these type of issues are common for most type of data.

The list with the requirements seems to be arbitrary. Were did the authors rely on defining them?

In the paper it is mentioned that the taxonomy is formed with rdfs:subClassOf but why weren't SKOS used?

The extended list with all the information about the lodging facilities, tourist locations and destinations, reviews etc. could be listed in the appendix. In the paper it is more interesting to discuss the why rather than the what.
Similarly, the detailed lists of functional requirements, the non-functional requirements, and the competency questions could go to the appendix as well.
Figure 3 could also be part of the appendix as well as the ontology snippet in turtle.

The discussion about the ontologies and what each one covers come in 3 places: related work, when their comparison is discussed and when the different ontology terms are reused. I would suggest the related work and the comparison to come together to avoid the extensive repetition.
It would help the paper if there were tables (or something similar) that clearly shows what the existing ontologies have in common and what they have differently and what the proposed ontology reuses, what extends and what introduces.

While the description of the ontology is interesting especially when it is discussed how the choices were made, the paper becomes too detailed on certain occasions giving the feeling of an ontology documentation rather than a scientific contribution.

I might be wrong but I think that the link to the ontology is not provided?

The data transformation is also very detailed. It would be referred that the paper focuses on the innovative aspects and avoid the excessive details which only give the impression of documentation.

It would also be interesting to position the KG development methodology to the existing ones.

Figure 7 could be optimized in space.

Section 3 is extremely lengthy. I would advice to split it in more sections, eg one for the ontology and one for the knowledge graph so it is not so long and there is no so much subsectioning.

In the knowledge graph generation part, a custom provenance solution is considered but I am wondering why the provenance solution of the RMLMapper was not used?
It would be interesting if the two solutions are compared methodologically but also with respect to performance.

**Evaluation**
How are the test cases defined? Were the competency questions considered? Why weren't SHACL used?

How were the structural dimensions chosen? Why are they good metrics for the evaluation of the ontology?
Similarly, the transparency indicators seem to be arbitrary, how were they chosen?