Data-driven Methodology for Knowledge Graph Generation within the Tourism Domain

Tracking #: 3314-4528

Authors: 
Alessandro Chessa
Gianni Fenu
Enrico Motta
Francesco Osborne 1
Diego Reforgiato Recupero
Angelo Salatino
Luca Secchi1

Responsible editor: 
Katja Hose

Submission type: 
Full Paper
Abstract: 
The tourism and hospitality sectors have become increasingly important in the last few years and the companies operating in this field are constantly challenged with providing new innovative services. At the same time, (big-) data has become the “new oil” of this century and Knowledge Graphs are emerging as the most natural way to collect, refine, and structure this heterogeneous information. In this paper, we present a methodology for semi-automatic generating a Tourism Knowledge Graph (TKG), which can be used for supporting a variety of intelligent services in this space, and a new ontology for modelling this domain, the Tourism Analytics Ontology (TAO). Our approach processes and integrates data from Booking.com, AirBnB, DBpedia, and GeoNames. Due to its modular structure, it can be easily extended to include new data sources or to apply new enrichment and refinement functions. We report a comprehensive evaluation of the functional, logical, and structural dimensions of TKG and TAO.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Umutcan Serles submitted on 17/Mar/2023
Suggestion:
Minor Revision
Review Comment:

The paper presents the semi-automated development of a Tourism Ontology for a Tourism Knowledge Graph.

(1) originality

Neither the methodology, nor the major results are particularly novel. This is also evident from the related work section, as the arguments regarding novelty are not very prominently made. The proposed methodology is not much different than many other methodologies in the literature. Although it appears to be bottom-up, the moment we involve competency questions, there is an implicit involvement of concepts and relationships. These are again implicitly used to pick the initial set of data sources that are used for data-driven ontology development. The developed ontology reuses many others, which is actually a good thing.

(2) significance of the results, and

The paper produces many results, and I appreciate the level of detail provided about them. In fact, it almost feels like two different papers. One is about the development of an ontology, the other one about generation of a knowledge graph. As a result, the authors list a total of 6 (!) contributions. I think it would be more beneficial for readers, if they highlighted the most important one or two contributions, which appears to be primarily the ontology, and secondarily the knowledge graph. I see the methodology and the software produced by the knowledge graph as rather side products, from the narrative of the paper. Judging from the source code provided in the paper, the software is quite specific to the generation of this particular knowledge graph, so the most reusable results remain as the ontology and the knowledge graph.

The evaluation mostly focuses on the ontology and it is quite detailed. It is also based on a framework called OntoMetrics, which provides a nice reference point. Although these quantitative metrics are important, I would still take them with a grain of salt, because, for example, having more classes is not necessarily a good thing, and may even have side effects on the knowledge graph side. For example, the authors model all amenities as classes and put them in large hierarchies. As much as I can tell, none of the amenity types bring new properties. On the knowledge graph side, this brings a lot of unnecessary blank nodes. For each hotel room with a Wifi amenity, we need to create an anonymous WiFi instance. This harms the conciseness of the knowledge graph. It would have been a better approach to model such amenities as individuals in my opinion. Similarly, having a lot of axioms is fine, but without concrete examples of applications instead of a large list of generic ones, it is hard to say if such a large number of axioms are necessary. Every new hierarchy and axiom makes it slightly harder to reuse the ontology due to ontological commitments.

For the knowledge graph, only some quantitative statistics and functional evaluation were provided. In general, I believe the knowledge graph is not evaluated as detailed as the ontology (e.g., conciseness), which also suggests that the paper is mostly about the ontology.

Nevertheless, the results are still significant, however some design decisions may be explained a bit better.

(3) quality of writing.

The writing is overall fine, but another round of proofreading would not harm. E.g.,

p1 line 47 WikiData -> Wikidata
P2 line 36 “a” knowledge graph
P2 line 42 relative to -> about
P10 line 37 schema:Organisation -> schema:Organization

Also there is a structuring issue. Section 3 is called Methodology for Ontology Design but it introduces the entire process, although the section is about only the first half of it. Also Section 3.2 has only one subsection, which is rather strange.

I did not understand the statement about GoodRelations asserting schema:Organization and schema:Place as disjoint, I did not see this in the GoodRelations Ontology v1, even after reasoning. http://www.heppnetz.de/ontologies/goodrelations/v1.html I can hardly imagine GoodRelations making such a strong statement about schema.org.
Finally, SKOS can be used for alignment with schema.org, which I believe would be a good indication for people to use TAO for web annotation (they can use corresponding schema.org terms for better SEO visibility)

The external data and material appears to be well-organized and mostly hosted on GitHub.

Overall, the paper is not particularly novel and currently a bit “overwhelming” in terms of structure and results, however the results seem to be significant and would contribute to feature research.

Review #2
By Eva Blomqvist submitted on 20/Mar/2023
Suggestion:
Minor Revision
Review Comment:

This is a revised version of an earlier submission, presenting a methodology and some resources related to tourism KGs and in particular assessment of tourism-related entities, such as places and accommodations. The paper has been considerably improved since the last version, and is now much more clear and the contribution is clear based on what is described in the paper. However, the paper still needs some work in order to be publishable, see suggestions and questions below:

There needs to be a section/paragraph on delimitations early on. Now it is only when we get to the KG building itself, and the sources, that we learn that including restaurants and museums etc is part of future work.

The related work section still needs some improvement, in terms of analysing the related work, and not only listing it. I do agree that this work has novelty, but it is not clearly defined in relation to the related work. In particular, I find section 2.1 to be quite confusing, and it does not specify a clear gap that this proposed work fills. First the authors say that the categories of methods are “collaborative”, “non-collaborative” and “custom”. Then the authors go ahead and describe the “custom” category sounding as if this is all about collaboration, e.g. “involvement of communities of practice” and “most of the time collaborative”. So what is the collaborative method category then? However, it is the last paragraph in that section that needs most extension, since it is not clear if the authors actually use an existing methodology, or develop their own? And in the latter case, why, and how is it based on existing work? Also section 2.2 needs re-writing, since that reads as a list of no-accessible or non-maintained resources, while not analysing them any further and without a discussion how this work builds on them. It is clear that the exact same resource does not exist, but the resource is not the main contribution anyway, as stated in the introduction, so the authors have to instead compare methods and scope etc., to show how these were still used, or why the authors could not simply resurrect or reimplement one of the older resources.

Additionally, section 3 also needs re-writing. The heading is “methodology for ontology design” but the section is not really about that, it has a much broader scope. This is the whole knowledge engineering methodology, i.e. one of the contributions of the paper. Only one small part of this methodology is about the ontology actually, in step 3.

I think the ontology sections are the ones that improved the most since the last version, it is now much more clear how the ontology is structures, what it actually contains, and how things were defined/used/reused etc. However, some small unclear points remains, such as: how do you know that the ontology can be easily extended, as you claim in point 8 on page 11? There are also still some small unclear points in Figure 2: I am pretty sure you don’t mean that owl:equivalentClass has domain tao:LocationAmenity and range acco:AccommodationFeature? Some arrows being used several times with the same label also seems to indicate that the domain is not a single named class, but an expression? tao:partOf also seems to be a very generic name for something that only holds between accommodations and lodging facilities. Figure 3 is also not entirely clear. What is the TAD base ontology? And how does it relate for the TA ontology, where I assume the latter is TAO? But then why is TAO the label of the man at the desk to the bottom left? That seems to be some user rather than an ontology. Further, I have a bit of an issue with the formulation of the competency questions, which seems to be a bit random in terms of their level of abstraction. I know the authors mention that they sometimes use very specific things, like wi-fi, to get questions directly translatable to SPARQL. However, I don’t really agree on this motivation. Also a more general question would be directly translatable to SPARQL, since I am sure wi-fi has superclasses or a type. I would suggest to keep the CQs abstract and “instance-free”, and then if it is useful for the testing, say that you create one or more detailed questions out of them, which will render a SPARQL query that should be usable without additional inferences of the ontology (e.g. without materialising all the types of some instance based on the is-a hierarchy).

Finally, I would suggest to include a discussions section, e.g. as a separate section before the conclusions, with some outlook on the possible implications of the work, how it can be used, what the limitations and lessons learned are etc. With the list of contributions stated in the introduction, really discussing all of them here would be very valuable to the reader. Similarly also the conclusions section should refer back to the introduction, and the list of contributions (which are now not even all mentioned in section 6).

Overall, I think all of these things can be fixed, and none of them risk to invalidate the conclusions or results of the paper, hence, I therefore do not think that another round of reviews is needed, but merely a quick check of the final version before publishing.

Detailed questions/comments:

The introduction section nicely lists the contributions, but could have benefited from also having some clear research questions of the work, that can be revisited in the conclusions section.

What is the difference between contribution 2 and 4 in the list on page 2? Are these different softwares? Methods?

What does “formal” mean in the first paragraph of section 2.1?

I find UC3 quite unclear. The other use cases are specific tasks, such as identifying topics in text (UC1 and 2), do sentiment analysis to identify sentiments towards something (UC4), classification of destination (UC5), but UC3 just says “support the recognition and linking of tourism entities in the KG for different applications revolving in the domain of social media …”. What are the applications of the KG there then?

In footnote 13: are you sure that a meeting room should count as accommodation according to your definition? If the definition is that broad, then also cars can accommodate people (a person can be inside a car).

Step 4 on page 14 is that entirely automatic? Seems very challenging. Or is there a user involved in validating the mappings?

There is a difference between the PROV model as such, and the PROV-O (ontology). It is not so clear which is meant in section 4.3.1. For instance, at the bottom of page 20 the authors say “In PROV we have three main classes:”, which should probably be “In PROV-O we have three main classes:”. Also in this section, it is not so clear why all this provenance data is collected. What are the use cases that need that? What are the requirements? And how is this data used in the end?

In section 5.2.2 suddenly the authors mention a SHACL file with further constraints. If this is not part of the KG actually, then why is it created and tested here? If it is actually a part of the KG definition and development, then it should be mentioned and described much earlier.

I am not really happy of the references in section 5.3. Ref [8] is mentioned as a source of metric definitions, while it seems to be rather applying and using existing metrics. [25] on the other hand is an appropriate reference there. On the other hand, OntoMetrics is just mentioned and links are provided, but never cited. As far as I am aware there is at least a workshop paper about it to cite. Still, some parts of section 5.3 are really good - those that really analyse what the resulting numbers mean, and don’t just list them, nice!

Layout, language and structural issues of the paper:
- In the abstract: “semi-automatic generating” -> “semi-automatically generating”
- Introduction, first sentence: shifting -> shift
- Page 2, first contribution: automatically -> automatic, and graph -> graphs or a graph
- Figure captions are sometimes placed above the figure and sometimes below. The normal placement would be to always put them below the figure (while the opposite is often used for tables).
- The division of the text into paragraphs needs to be revised throughout the paper. Many paragraphs are just 1-2 sentences long, and actually belong together with other paragraphs. Normally, a paragraph should be at least 3-4 sentences long, at least.
- The structure of subsections is not good. In several cases there is only one subsection of a section, this should not happen. Either you have several subsections or you skip them all together. Examples of this is 3.2 that only has a 3.2.1, and 4.2 that only has 4.2.1 and so on.
- I may be getting old, with bad vision, but Figure 4 is really too small for me to comfortably read when i print the paper.
- Section 4.1, first paragraph: direct -> directed?
- Once you get further into the paper, you start to be confused by the different numbers of different things. There are too many bullet lists and figure with numbering. For instance, how is the numbers in figure 1 related to the sections in chapters 3 and 4, and what about the numbered list at the top of page 17, are these sub-steps of one of those phases in Figure 1? I would suggest to think about some consistent numbering, e.g. the main phases in Fig 1 named 1-6, then all subtasks/steps have numbers like 1.1, 1.2 and so on. Otherwise it is hard to follow where in Figure 1 we are at any given time.
- Sometimes you write TAO and sometimes TA Ontology (e.g. in Fig 5), probably this is the same thing, but the reader starts to doubt it after a while.
- Section 4.3, bullet 1 of the first bullet list: “at the and of”?
- Convert the query and the table on page 23 into a proper (numbered) listing and table, and refer to their numbers in the text. Also, for the text right beneath the table: use a footnote for the URL instead of having it in the text directly.
- On pages 26-27 Accommodation Ontology is suddenly called Acco.
- Appendices should be places after the bibliography.
- The reference list needs more work, many of the references are incorrect. Just reading the first 10 references, I find at least 5 that lists the publisher or the book series as if it was the book title, and totally lacks the book title (see 1, 3, 4, 7, and 9), and one that lacks the publisher instead (10). I did not go through the whole list, but this should be corrected throughout.

Review #3
Anonymous submitted on 20/May/2023
Suggestion:
Minor Revision
Review Comment:

I thank the authors for their updated version of the paper.

The related work section has improved, however I think that, despite mentioning the state of the art, the paper still does not position the proposed work with respect to the existing works, while the evaluation only assesses if the ontology fulfills its purpose by considering arbitrary measures. All my comments from the previous review round were not considered. While the ontology seems to answer the competency questions, it still only validates that it fulfills its purposes but not that it is original and novel. Neither the ontology nor the pipeline and methodology are still compared with the state of the art which makes it difficult to judge the significance of the results of this article.

The authors followed my advice only on putting some aspects of the paper in the appendix, but my comments regarding the paper being too verbose and often repeating the same concepts were ignored and the paper remains lengthy in my opinion. Thus, this comment remains for this version as well.

In detail, the paper claims, among others, the following contributions:

- a **“general” data-driven methodology** for semi-automatic generation of knowledge graphs.
There are many methodologies proposed, and even more used, why is this methodology general as opposed to the others? Why the others are not general?
If the methodology is one of the contributions, then how does this methodology compare to other methodologies? Which are its novel aspects? As far as the steps is concerned, this seems to be the trivial methodology that most use cases follow nowadays.

- a **pipeline for generating a (tourism) knowledge graph**. What is novel with this pipeline? What is novel as opposed to similar methodologies used to generate knowledge graphs? The related work section now barely mentions other similar projects and their status (e.g., discontinued) but it does not compare the proposed one to the previous ones. Which pipelines did the other projects use? How does this pipeline differ?

- the **Tourism Analytics Ontology (TAO).** As with the pipelines, the related work currently only states the past ontologies but it does not compare them with the proposed ontology. The paper is still limited in just stating that the existing ontologies did not cover all concepts. Do they cover different concepts and properties? what are their similarities and differences? why is a new ontology needed? A need for a thorough comparison was mentioned in the previous review round but it was still not incorporated in the paper.

The evaluation assess the following:

- functional dimensions based on competency questions that assess the ontology. However this only proves that the ontology fulfills its purpose but not its originality compared to the state of the art.

- logical dimensions based on inference and error provocation that assess the knowledge graph. However, referencing is meant to derive new knowledge not to validate an ontology while the error provocation is limited to 8 test cases which are not designed in a systematic way, thus we cannot be sure that all cases covered all possible errors (could that ever be possible?).

- structural dimensions to assess the ontology and the knowledge graph. While these metrics were used in literature, it is not clear what it is proven here. The proposed TAO ontology is only compared with Acco (an ontology which is reused by TAO) and only 1 other ontology, the Hontology but not with all other ontologies which were used in the tourism section and for which a comparison is needed to show how TAO positions with respect to the state of the art.

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. As it is clear from the summary above, neither the methodology nor the pipeline are evaluated even though they are mentioned among the contributions, while the ontology’s evaluation and positioning with respect to the state of the art is still vague. Without addressing these concerns regarding how the proposed work is novel and original compared to existing works, I cannot see how the article can be accepted especially as a full paper.