A Knowledge Graph for Semantic-Driven Healthiness Evaluation of Online Recipes

Tracking #: 3260-4474

Charalampos Chelmis
Bedirhan Gergin

Responsible editor: 
Mehwish Alam

Submission type: 
Dataset Description
The proliferation of recipes on the Web presents an opportunity for developing AI methods to promote healthy nutrition of people using the Internet as a source of food inspiration. Recent research endeavors have resulted in the development of ontologies related to food, and algorithmic solutions for ingredient substitution. However, there is a lack of a resource oriented towards promoting research in semantic-based algorithmic meal plan recommendation and/or individual ingredient substitution that explicitly incorporates healthiness into the recommendation process. To address this gap, we present a knowledge graph comprising a large collection of recipes sourced from Allrecipes.com, their ingredients and corresponding nutritional information, social interactions metadata, and healthiness information calculated based on two international nutritional standards. We describe the construction process of our knowledge graph, and show its utility in quantitatively evaluating the healthiness of online recipes.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Feb/2023
Major Revision
Review Comment:

The paper presents a knowledge graph aiming to describe a dataset composed of recipes and to associate with each of them a sort of nutritional score based on the ingredients contained within each recipe.
In general, I do not have any major issues concerning the paper.
I think that this is a very timely topic in an era in which a lot of effort is focused on creating AI-based solutions for supporting health and well-being.
The paper is easy to read and the process of constructing the knowledge graph is clear.

I would only recommend the authors perform the following actions in order to enhance the overall quality of the paper.

First, it may be very helpful to have a section dedicated to the construction of the underlying schema containing information about the ontology engineering methodology adopted, the choices that have been done in selecting concepts and properties, etc.
The paper would definitely benefit from such content.

Second, the related work section should include not only the existing knowledge graphs concerning recipes but also the ontologies available in the nutritional domain.
For example, the authors mentioned the HeLiS ontology in the conclusion, I think that the HeliS ontology and others should be directly included in the related work section.
Moreover, it would be very interesting how the underlying schema of the proposed knowledge graph may be aligned with such ontologies (e.g., the HeLiS ontology).

Third, the automatic extraction of information from free text is a challenging task.
The authors mentioned the difficulties they encountered in doing that.
It would be very interesting to see further details about the effectiveness of the extraction strategy they adopted.

Review #2
Anonymous submitted on 21/Apr/2023
Major Revision
Review Comment:

The paper introduces a Knowledge Graph for the Healthiness evaluation of the recipes.

The paper is very clearly written and clearly situates the work wrt the SOTA KGs out there.

- For extracting the ingredient information, why do the authors need CRF? Why not simply use RegEx for extracting this information?
- Does the KG contain information about the taste of each of the ingredients? For example, salt and sugar look similar.
- The authors motivated the KG where it can be used for replacing the ingredient, the case represented in the previous point may be an issue.
- The competency questions are not defined explicitly, the SPARQL queries at the end of the paper are fairly simple and treat the KG created as a database. Why RDF is needed to express this information?
- How do authors retain the way of preparation (the sequence followed) for cooking a recipe? For example, in "Biryani" which is an Indian recipe if the chicken is replaced with Beef the cook time changes, and slight modifications in the process follow.
- Can these SPARQL queries be used for recommendation purposes?

Review #3
Anonymous submitted on 04/Jul/2023
Review Comment:

The authors scrape schema.org data of recipes from Allrecipes.com, including ingredients, nutritional information and provenance data. The ingredients are normalized and 89 ingredients are linked towards the FoodOn ontology. The scrapped nutritional information are used to analyze the healthiness of recipes.

* Availability: The described RDF dataset containing the KG is not available as dump. The persistent identifier only links to a dataset containing a "sample-data.ttl" file with a small subset (100 recipes). The SPARQL endpoint seems to provide access to the complete dataset, but returns different numbers (13,979,375 triples and 77,317 recipes) and fails to get results on all three provided example queries (due to errors in the queries). The GitHub repository provides multiple Turtle files based on the different sitemaps used for extraction as well as an ontology file. The provided ontology defines only 30 classes, 19 datatype properties and 22 object properties. The authors seem to have counted reused properties and at least the described subclasses of recipeKG:recipesCategory are missing. Overall, the dataset should be consolidated. Persistent identifier, SPARQL endpoint and GitHub should use the same dataset to provide one consistent version of the dataset, which fits to the examples given in the paper.
* Licensing: The paper mentions the CC0 license, but the ontology states CC BY-NC 4.0 as dcterms:license and the GitHub repository uses a Apache-2.0 license. The authors might consider explaining these different licenses as part of their paper in order to ensure reusability. However, given that this dataset is generated based on scrapped recipes, which could be considered intellectual property of allrecipes.com, the authors might not be allowed to publish such a dataset at all. Before publishing such a dataset it might be a good idea to contact allrecipes.com to avoid copyright complaints and ensure the long-term availability.
* Knowledge modelling patterns: The KG links towards schema.org as external resources and uses RDF, RDFS and OWL vocabulary. However, the authors do not provide details on the used vocabularies. Additionally, no knowledge modeling patterns are described and no requirements are formulated. Consequently, the structure and design of the KG is difficult to understand. For example the KG use many blank nodes, but does not provide an explanation why they are used.
* Sustainability: The presented dataset is based on scraped data from December 2021. Considering this, keeping the dataset updated with more recent recipes from allrecipes.com seems to be difficult. However, an out-dated dataset is not sustainable. Additionally, the authors mention the GitHub repository as a way to encourage collaboration and extend the KG. The last git commit is from June 2022 (over a year ago) and no issue or discussion can be seen as part of this repository.
* Clarity and completeness: The generation of the KG is clearly explained in detail and the step of splitting the ingredient list into ingredient, quantity and measurement unit as well as the entity linking step are even evaluated. However, the evaluation can not be reproduced with the given data and the claimed accuracy of 100% raises a red flag. Compared to this the novel contribution of the dataset (modelling healthiness scores with SWRL rules) is hardly described at all.
* Novelty: The paper makes it particularly difficult to recognize the contribution of the authors. Given that allrecipes.com already uses schema.org to annotate recipes most of the scrapped data seems to be only converted from JSON-LD to Turtle format. Given that the main novelty of the purposed dataset would be the modeled healthiness scores the dataset does not seem to be particular novel. Nutritional information as well as nutritional guidelines are already modeled by for example FoodKG.
* The introduction discusses many potential use-cases of a recipe knowledge graph containing detailed nutritional information, but the created dataset is only able to provide a healthiness score based on SWRL rules. For example table 1 claims that the KG can be extended with additional guidelines. However, it is not tested nor described in detail how such an extension could be done.
* How is the nutritional information from allrecipes.com generated? Looking on example recipes provides this disclaimer: "Powered by the ESHA Research Database © 2018, ESHA Research, Inc. All Rights Reserved" This implies that the nutritional information from allrecipes.com are automatically calculated estimations based on an already existing dataset. Additionally, it would again raise the question if the authors even have the required rights to publish such a dataset.

Minor comments:
* It is not understandable why normalizing measurement units in a KG should be a challenging research problem. Of course converting a volumetric measurement unit into a weight based measurement unit might depend on the ingredient, but especially here a KG could provide a great benefit without too much effort.
* Figure 8 should have a uniform style. Why is subfigure d presented in a different style?

Overall, the paper and dataset lacks a clear use-case. Scrapping allrecipes.com and analyzing this dataset provides many good starting points for relevant research. However, the purposed dataset does not seem to be updated already nor is the dataset sufficiently linked towards existing datasets in this area (mapping only 89 from 6309 ingredients without further explanation does not make sense) or particularly well described. For example it lacks such important information as how this nutritional information are even generated. Implementing healthiness scores based on SWRL rules is an interesting approach, but the paper does not go beyond defining a few rules and it completely lacks any discussion on this. The second part of the paper could benefit form actually considering research question from domain experts. The analyzed categories seem to be not particular relevant for experts or what would be the impact if "trusted brand recipes" are more healthy than "world cuisine" recipes? Unfortunately, the current state of this paper is not mature enough to be published.