A systematic mapping study on combining conceptual modeling with semantic web

Tracking #: 3329-4543

Authors: 
Cordula Eggerth
Syed Juned Ali
Dominik Bork

Responsible editor: 
Dagmar Gromann

Submission type: 
Survey Article
Abstract: 
Conceptual models aim to represent real systems at a higher abstraction level. The Semantic Web intends to add meaning to any kind of data format to arrive at linked data. Taken together, both help facilitate data processing and integration for humans and machines. In this article, we systematically analyze the research landscape at the intersection of conceptual modeling and the Semantic Web by means of a systematic mapping study (SMS). Following the SMS research methodology, first, the research scope is defined, then the search queries are designed and executed. From initially 5.107 publications, we systematically filtered out all irrelevant ones, leaving us with 484 eventually relevant ones. These publications are analyzed and mapped to a number of taxonomies. The extracted and refined data is analyzed in several analysis steps comprising bibliographical, content, combined taxonomy, and research community analyses. In this paper, we show the most active institutions and authors in the field and the topics on which they work. We moreover show, by which means (e.g., modeling language and Semantic Web standard) research predominantly combines the two areas. Besides highlighting flourishing research areas, we also point to many remaining areas for future research in which we scarcely found existing works and/or see great potential.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/Mar/2023
Suggestion:
Major Revision
Review Comment:

Overall evaluation:

This paper reports on a systematic mapping of the literature on the intersection of conceptual modeling and Semantic Web. It explores an interesting gap in the literature. It proposes interesting research questions.

While I would like to eventually see this paper published, there are a number of important issues:

Section 2 (Related work): there are some problems with reproducibility of the search for related work
Section 3 (Research questions and research methodologies): search strings and taxonomies require better justification
Section 4 (Findings): mostly presentation issues, but some analysis claims require better justification

The paper must be accompanied by supplementary material, with tables with all papers identified, their classifications, etc. It must be possible for a reader to see which papers have been classified under which category.

See detailed comments below.

Section 2

While I believe the literature gap that the authors identify in Section 2 really exists, there are problems in the reproducibility of the search in that section.

I found it unusual that only 55 results were found with the query, because of the various disjunctions.
I’ve tested the more restrictive:
(survey) AND (semantic web) OR (survey AND conceptual model)
On TITLE only and found 115 results. Adding quotes still leads to 96 results:
(survey) AND (“semantic web”) OR (survey AND “conceptual model”)

Please double check and also specify in such a way as to ensure reproducibility.

In the first search query we are presented, it is odd that “systematic literature review” (or “SLR”) is not included in the disjunction.

Section 3.1

Concerning the research questions, which are quite interesting…:

Concerning RQ6, which currently reads:
“RQ6: What value is generated by the combination of conceptual modeling and Semantic Web?”

The way it is formulated now does not seem precise enough. Can in be answered as such in the scope of the SLR? Contrast this with: What value is *attributed to* the combination of CM and SW in the literature? By the way, this “value” analysis is the one that requires more justification, including how you have classified papers later in each category.

Concerning RQ7, which currently reads:
“RQ7: Which insights can be derived from a cross-taxonomy analysis?”

It is not clear upfront what kinds of taxonomies you are talking about and what a cross-taxonomy analysis in this case would mean.

Section 3.2

The introduction of the taxonomies mixed with the phase “Keyword abstracts” is very confusing in terms of structure. Please introduce the various taxonomies first (an overview), and then detail them, justifying/explaining the choices of categorization. Some taxonomies appear out of the blue, e.g.: “Value Added of Combining SW and CM Taxonomy”. They certainly need to be justified, especially this one about value. This is one of the key issues with the paper in its present state. It hampers readability, but it also poses a methodological problem (since the quality of the taxonomies is key to the SMS).

Please justify the design of the search query in Fig. 2.

There are typos in the search string: “ontolog” “semanticweb”. If this is indeed the string that was used in search, this is worrisome.

Please justify exclusion criteria (especially why published before 2005 is excluded.)

You don’t need exclusion criteria that negate inclusion criteria.

After Fig. 3, we are told some details about the export from Scopus, but there are many papers from ACM, WoS and IEEE. So, what about processing of these papers?

I would leave out 2022 altogether from the graphs, because this is partial information and affords perverse inferences (even with the warning in text).

When analyzing Fig. 5, I would raise the possibility of the number of conference publications having been affected by the COVID-19 pandemics. This is quite striking in the diagram, and because of the pandemics it is hard to trust the explanation provided (that the area is maturing). This explanation is repeated a number of times.

Section 5:

When searching the Web Knowledge base I found only one paper by Tobias Walter:
https://me.big.tuwien.ac.at/cmsw/analysis?analysis-type=author-by-public...
But a node for Tobias Walter is given great weight in Fig. 18. What explains this? What does “document output weighting” means?

What explains the focus given to three particular clusters in Fig. 19? Why are these selected?

The time component sometimes is mentioned in the text but it not always present in the corresponding figures.

I got errors when accessing the site and making a simple search:
AttributeError at /cmsw/search
'NoneType' object has no attribute 'startswith'

Editorial remarks:

What is “the real perspective” in section 1?

Please revise “It has intended” and “from a web of data to a web of documents”. Page 2, line 9.

RDF(S) appears and then RDFS (section 1).

Typo: “non” (section 2).

Please add proper Section headings for each phase in Section 3.

Heading “4. Keyword abstracts” is not a good name for a phase.

In the taxonomies, there is quite a mixed style when presenting the various standards. Some sentences start with “RDF: A standard model…”. Others repeat the name of the standard: “RDFa: RDFa means…” Others use “it”, others simply start with a verb: “POWER: Offers…”.

“DSL” sometimes is confused with a language in the text. For example, in page 17, lines 36-37.

Extra space lines 24-25 of page 24.

There are problems with quotation marks throughout the paper.

Please revise throughout for consistent capitalization of “Semantic Web”.

Some food for thought: should “foundational” papers better be qualified as “application-independent” or “domain-independent” or “general-purpose”?

The quality of Figure 14 is too low.

The conclusions talks about “The single taxonomy”. I did not understand the reference.

Typo in Section 8: “Knowledg”


Fig. 3 talks about “search query 3”, but only one search query is presented in the text (the final one).

Review #2
By Andon Tchechmedjiev submitted on 05/Apr/2023
Suggestion:
Major Revision
Review Comment:

This paper presents a systematic mapping survey at the intersection of semantic web an conceptual modelling. The authors cover related work on systematic reviews on the semantic web and computational modelling, followed by a presentation of the selection and analysis methodology, and by an analysis, first of individual taxonomy items for each field, then the intersection for a cross-analysis of both fields. The paper offers an analytical reply to the research questions formulated at the beginning of the survey. A tool is also developed to explore the extracted database according to the various taxonomic criteria.

While the overall methodology is sound and the analysis interesting, the major shortcoming of this paper is the way the work is introduced and presented.

In my opinion the introduction completely mischaracterizes both fields and their intersection (perhaps because it's not cast in terms of the broader area of knowledge engineering). While the related work is presented pertinently, the authors only use Scopus to perform this analysis, thus excluding most semantic web venues (this only concerns the related work methodology).
What's more, again in the beginning of the document, there are a significant number of grammatical errors and presentation issues. Great effort should be invested in streamlining the presentation of the paper, clarifying the positioning of the paper, better introducing the concepts addressed in the paper.

Detailed comments:

P1 33 ) sees the underlying reality from a more abstract perspective, which focuses on the necessary features, while leaving out the unnecessary ones > The text of the survey should be accessible to all audiences and self contained. Starting this way assumes good knowledge of both fields, I would rather have a borad opening that explains the commonalities between the two on a higher level and then present the divergent elements. This would be much more accessible.
The current introduction starts on the subject matter abruptly without taking the time to explain.

P1 37 To some extent, also semantic parts are included at this perspective [4] > This sentence is not grammatical (word order, choice of prepositions at > in). There are quite a few such instances, please make sure to make another detailed proof-reading, making sure to minimize the literal cross-lingual transposition of syntactic patterns. Numerous an recurrent errors particularly in the beginning of the paper.

P1 40 There are, among others, languages with structural elements like “entities, relationship, and constraints“, and languages with strengths in representing behavioral aspects comprising, e.g., “states, transition, and actions“ [4]

Those two things are not of the same nature, I feel this is a false dichotomy. What's more, why are those two types of languages important in this context?

P1 l48 But a formal underpinning does not add rich model semantics yet. This has to be done by “associating semantics to the language elements“, according to [2]. > In proper typography, one is expected to give an indication of the authors when citing in numerical style, when the citation plays a grammatical role in the sentence. This should be "according to Mayr et Thalheim [2].

P2 L1: To me the semantic web is clearly a particular technological framework built upon one form of conceptual modelling. Actually the definition given of conceptual modelling is very much compatible with the open-world assumption adopted for the semantic web. Building a formal ontology clearly relies on conceptual modelling methodologies. This definition of semantic web is extremely reductive I feel and overall the nature of both domains is very badly mischaracterized.

P2 L 17 "Both conceptual modeling and Semantic Web are independent research areas, in which researchers are continuously exploring the field. However, the topics also intersect to some extent." > This sounds extremely generic and brings no information whatsoever.

P2 46: 2. Related Work > Why Scopus? Neither the Semantic Web Journal, nor ISWC are indexed in Scopus, yet those are the premier Semantic Web venues... this would exclude a large chunk of potential papers performing SMS on the semantic web side.

We don't have the full references here, so impossible to check if the venues indeed exclude the principal semantic web venues.

Section 3.1 Why were those RQs chosen? What's the rationale, it's not enough to just list them. Are they standard RQs for the methodology you employ?

Figure 2: the space in semantic web is missing, that feels like a potential issue in a Boolean query that matches the keywords to the character.

Section 3.2: Did you try using the semantic scholar unified API?

P7 L32: beginning of sentence not grammatical.

Fig 14: Very bad quality, you need a high DPI figure there
Fig 15: Improperly cut?

Review #3
Anonymous submitted on 19/Apr/2023
Suggestion:
Major Revision
Review Comment:

# SWJ 2023 - A systematic mapping study on combining conceptual modeling with semantic web

**Title:** A systematic mapping study on combining conceptual modeling with semantic web
**Authors:** Cordula Eggerth, Syed Juned Ali, and Dominik Bork
**Submission Type:** Survey Article
**Venue:** Semantic Web Journal

## Overview

The paper presents a systematic mapping study investigating publications on the intersection between conceptual modeling and semantic web. The paper is clear and well-structured, covering an expressive number of publications selected under good inclusion and exclusion criteria. The classification of each publication considers several relevant dimensions (referred to as taxonomies), and this information is made available through a dedicated portal that presents all collected metadata. That being said, I judge this to be a relevant submission to this venue and I congratulate the authors for their effort.

Still, I see several points that should be addressed for this submission to be accepted. All these points are listed below under [Major Comments](#major-comments) where I call special attention to the reproducibility of publication database queries, the consistency of presented data, the alignment between data and claims, and the availability of data through the portal. Note that some of the issues I list maybe be indeed oddities in the data, rather than mistakes, but I would like the authors to check these points.

I look forward to the revised version of this submission.

The next specific comments are divided between major and minor depending on whether they influenced the overall evaluation of the submission.

## Major Comments

- I did not understand the use of reference [4].
- Reference [7] seems to be a course that I couldn't find the contents of online. If that's the case, do not use it as a reference. Refer instead to the seminal works you have used in the classroom.
- Please consider this reference in your related work: On the Philosophical Foundations of Conceptual Models; doi:10.3233/FAIA200002.
- The conclusion section is quite poor. There are no meaningful reflections on the work done or signaling to future works.
- I tried reproducing the query of Figure 1 as presented below. It returns over 28 000 hits, almost 7 000 on computer science. How did you arrive at 50? Please present queries in text or code format to ease reproducibility.

```
(survey OR systematic mapping study OR sms OR mapping study OR systematic mapping) AND (semantic web OR semantic systems OR knowledge graph OR linked data OR linked open data OR ontology OR rdf) OR (survey OR systematic mapping study OR sms OR mapping study OR systematic mapping) AND (conceptual model OR modeling language OR modelling language)
```

- Table 2 seems to be completely wrong, with duplicated titles and incorrect author/title matches. Examples:
- Alloghani and Gacitua as the first authors of "The XML and Semantic Web"
- Sabou as the author of "Semantic Web Services Testing" instead of "Semantic Web and Human Computation: The Status of an Emerging Field"
- Section 3.1 could benefit from an accompanying text highlighting the reasoning at least for the most relevant research questions. RQ5, for instance, reads in a confusing way.
- I am of the impression that in Section 3.2 you want to describe the phases of your SMS, and then explain how you execute them, or what they mean in the context of your research. The way it is currently written is a rephrasing of things you said earlier. Some ideas here could also have been used to better explain Section 3.1.
- "...considering title and abstract...": according to Figure 2, you also consider keywords.
- I believe that it would be relevant to include the reasoning behind the keywords chosen for the search query. Also, when trying to reproduce the query of Figure 2 in Scopus, it returned ~1500 publications, rather than ~2100. I don't see this as a serious issue because it is still clear the breadth of the research query. Nonetheless, I would like that these queries were easier to reproduce, especially outside Scopus. See below how I reproduced the query of Figure 2:

```
TITLE-ABS-KEY(
(
({conceptual modeling} OR {conceptual modelling} OR {metamodel} OR {meta-model} OR {metamodels} OR {meta-models} OR {domain specific language} OR {domain-specific language} OR {modeling formalism} OR {modelling formalism} OR {modelingformalisms} OR {modelling formalisms} OR {modeling tool} OR {modelling tool} OR {modeling tools} OR {modelling tools} OR {modeling language} OR {modelling language} OR {modeling languages} OR {modelling languages} OR {modeling method} OR {modellingmethod} OR {modeling methods} OR {modelling methods} OR {modeldriven} OR {model-driven} OR {mde})
AND
({knowledge graph} OR {knowledge graphs} OR {linked data} OR {linked-data} OR {semanticweb} OR {ontolog} OR {RDF} OR {OWL} OR {SPARQL} OR {SHACL} OR {semantic systems} OR {semantic system} OR {semantic technologies} OR {semantic technology} OR {RDFS} OR {protege} OR {SKOS} OR {simple knowledge organisation system} OR {JSON-LD} OR {rule interchange format} OR {semantic modeling} OR {semantic modelling} OR {linked open data} OR {vocabularies})
)
AND
(
LIMIT-TO (SUBJAREA, "COMP")
)
)
```

- The ACM Digital Library, for instance, encodes the search query in a URL (example below), so you could use a hyperlink to make your queries easily reproducible to those accessing your paper's PDF. But I would not ask to have these huge URLs spelled out in the text though.

```
https://dl.acm.org/action/doSearch?fillQuickSearch=false&target=advanced...
```

- There seem to be some important errors on https://me.big.tuwien.ac.at/cmsw
- When searching for authors like "walter" and "buchmann", the interface returns an error:

```
AttributeError at /cmsw/search
'NoneType' object has no attribute 'startswith'
Request Method: POST
Request URL: http://me.big.tuwien.ac.at/cmsw/search
Django Version: 3.2.18
Exception Type: AttributeError
Exception Value:
'NoneType' object has no attribute 'startswith'
Exception Location: /apps/app/searchutils.py, line 146, in
Python Executable: /usr/local/bin/python
Python Version: 3.7.16
Python Path:
['/',
'/usr/local/lib/python37.zip',
'/usr/local/lib/python3.7',
'/usr/local/lib/python3.7/lib-dynload',
'/usr/local/lib/python3.7/site-packages']
Server time: Mon, 17 Apr 2023 14:07:01 +0000
```

- Other authors return a seemly low number of publications, like "guizzardi" returning 10.
- The summary tables on the homepage, like "Detailed Analysis of Institute By Papers", do not seem to agree with the paper's numbers. For example, the University of Vienna has 21 publications according to the paper, and only 7 according to the website.
- I would like to have access to the working Web Knowledge Base to better understand the taxonomies listed in step 4 of Section 3.2. Most searches for publications I tried running returned errors, even "uml", which is used in the example of Figure 21.
- I believe that some sentences are problematic in terms of causation vs correlation. This excerpt is an example of what I mean: "This confirms that the field was growing but was not maturing until then. Since 2019, the number of conference papers published has come down to a level similar to the number of journal articles, which indicates that the research in the field is starting to mature in recent years (see Fig. 5)."
- First, I believe that more rigorous analysis is called for to support conclusions about the maturity of the research in the field.
- Second, I find it difficult to disassociate the drop in conference papers in 2019 and the global pandemic. The dataset doesn't consider how conferences have been organized in those years, neither how many submissions they received overall.
- The information in tables 3 and 4 (they are actually 3 tables) seems strange. If these tables are indeed correct and there no incomplete information in the spreadsheets, there are huge long tails in the distributions of publications per institution and venue. Is it possible for the authors to share their raw spreadsheets? Check this out:
- Total publications analyzed: 484
- Top 10 institutions summed up (including overlaps, I assume): 23+21+14+10+8+8+7+7+6+6=110
- Minimum overall number of institutions: (484-110)/6=63
- Top 10 conferences summed up: 17+6+5+5+5+4+4+3+3+3+3+3=61
- Top 10 journals summed up: 6+5+5+4+4+4+4+3+3+3+3=44
- Minimum overall number of venues (conferences and journals): (484-61-44)/3=127
- These numbers mean that there are at least (likely much more) 63 other institutions and 127 other venues beyond those listed in the tables.
- Bear in mind that several papers would have co-authors from different institutions, increasing their number. Also, the same institution could be listed with the names of their research groups or under different spellings may be affecting the data.
- Regarding plots like the one in Fig. 6, it is hard to make any conclusions about the trends in the data. If you don't compare the rates of growth of each line with the overall rate of growth in the field, the former could be simply following the latter.
- Take this excerpt now: "In the mid-2000s, all categories started from a low level, while the number of publications on Linked Data and Vocabularies increased considerably after 2011, the number of publications on Inference and Queries achieved merely a slightly higher level in this time period." I think this is an interesting finding in the SMS deserving of a deeper reflection. Is it the case that queries and inferences are not seen as relevant subjects in the intersection between CM and SW by the community? Or is it the case that authors simply view queries and inferences as a byproduct of expressing their models in the SW world? The information in the taxonomy of Fig. 10 can help in this analysis.
- Again, in Fig. 14 it seems hard to disassociate the change in numbers from the overall growth in the number of publications. Also, some combinations simply have a sample size too small to support any meaningful conclusions.
- That the following excerpt: "Overtime notably the combination of methods with representation have grown considerably, as well as in general all of the largest combinations mentioned above. However, the combinations of taxonomy elements in the lower left corner exhibited a significant decrease over time." Fig. 15 does not support this analysis over time.
- Please consider carefully how significant are the differences between bubble sizes when drawing your conclusions. In Fig. 16 I cannot see a significant difference between BPMN and OCL, ER, or AML. I am not saying that there isn't one, but the plot doesn't make this clear.
- Is it the case that some combinations are especially interesting, or do they simply involve more popular elements? For instance, in Fig. 17, it seems that any combination with UML will be heavily influenced by the overall number of publications involving UML.
- Could you please check whether the affiliations of the most prominent authors (e.g., D. Gasevic) are listed correctly in Table 3? Something seems strange. I am also surprised that there is no cluster around Guizzardi in Fig. 19 given his involvement with two top institutions of Table 3 and the number of publications involving OntoUML.
- As of April 19, some features of the website are back online (e.g., searching for "uml" works now). The tables remain inconsistent however with the search itself. For example, there are 2 publications associated with "guizzardi" according to the list of authors, but the search returns 10. Still, please consider this return of features when reading the comments above.
- I really appreciate the authors' effort to provide an interface for users to crawl their data. However, I would invite them to go one step further and embrace the semantic web approach by publishing the metadata as linked data. Doing so while adopting the FAIR principles would be even better.
- Table 3, the University of Valencia is not in China.
- Table 4, it is curious that there are as many publications in the SW/CM intersection in the IEEE Aerospace Conference and the International Semantic Web Conference. I am not saying that it is wrong, but I wonder if CM authors avoid/cannot publish at ISWC.
- Regarding the arguments about the field's maturity, I wonder how much of it relates to how the community is organized as well, i.e., how prevalent conference papers are. Also, when compared to other fields, conference papers in the CM and SW fields are subject to quite a rigorous peer review that evaluates the submission in its entirety, rather than just an abstract, for example.

## Minor Comments

- From the beginning, you tie conceptual models to information system design but leave the "semantic web" free of that constraint. I invite you to think of both in the same manner, as this idea may justify the unintended presence of design concerns in conceptual models.
- "was was"
- "When Sandkuhl et al. (2018) conceived...": I guess this reference is not compatible with the brackets style. Shouldn't it be "Sandkuhl et al. [13]"? This seems correct later in the conclusion; "... from Petersen [23] and Kitchenham [24]."
- I would have liked to see more about the research goals in the introduction. The intro does a good job on the motivation side, but there is very little on "what we are going to do" or "what we want to achieve" at that point.
- There is a different header in Table 1, "Topic".
- If you want to adopt the abbreviations SW and CM, it would be best to go all-in instead of using them alongside the full terms.
- In the text, spell logical operators using capital letters (e.g., "AND") the same way you spell them in your queries.
- In Figure 2, there are a few entries that could have been misspelled (e.g., missing spaces between words). This is not a major issue, but still a consistency issue.
- "One publication was assigned to only one Semantic Web activity area." Do mean "each publication was assigned"? Also, I would replace the Earlier reference to "W3C activity area" with "Semantic Web activity area" to avoid confusion.
- Please eliminate the names with vertical alignment in the x-axis, it is unnecessary and very space-demanding (e.g., Fig. 10).
- Consider re-ordering the taxonomies presented in Section 3 according to the order they are used in Section 4.
- Please increase the resolution of all figures but with special attention to Fig. 14 which is not readable even on a computer.
- Please avoid cropped plots. Fig. 16 is a major example of this.
- I noticed a few grammatical mistakes in the paper, but these can be corrected with the support of automated tolling.