Review Comment:
The paper has been submitted as an Ontology description, and indeed has an ontology as its main subject. The euBusinessGraph Ontology (EBG) is an interesting resource, and deserves publicity within the semantic web community as well as in the business info domain community. The model is freely available (on github); I am however not sure about the license (at least, the paper does not mention it).
As the main downside, I am struggling with the overall scope of the paper, which also includes topics that are only partially related to the ontology itself and significantly increase the paper’s size. The category of „Descriptions of ontologies“ is defined at the SWJ website as „short papers describing ontology modeling and creation efforts“. The paper however has 39 pages (which would have been quite a lot even for a full paper!), of which:
- 10 address the motivations, SotA, requirements and the development process
- 14 provide a reference overview of the ontology
- 13 describe use cases and follow-up projects
- 2 contain the biblio.
The first part is what I would truly expect in an onto paper.
The second part already makes the paper a bit longish, but prevents the reader from having to peep into some documentation/tutorial in parallel, so might still be acceptable, too.
However, the third part is, in my opinion, beyond the scope of a paper of this kind. I could imagine that *very short* (say, 2-3 pages in total) descriptions of use cases, with links, might be relevant for an onto paper. A table showing which parts of the ontology have been used in each use case could be nice, too. But not those 13 pages here: most of the text either does not refer to the ontology at all or only mentions it in an uninteresting way. No modeling challenges addressed while adapting legacy datasets to the ontology are mentioned (e.g., Fig. 20 simply shows that some datasets may use properties not used in other datasets – but that is an inherent feature of the graph model of linked data, not an added value of the EBG ontology by any means). Instead, features of the authors’ company/institute platforms/applications (DataGraft, GraphDB, or the euBG Marketplace) or additional models (ONTO-CG) are described. They might perhaps deserve their own (dataset, application, ontology, or tool/system) papers in SWJ, or be part of some overview paper of the euBusinessGraph EU project, but should definitely not overcharge the current ontology paper. Even the SWJ instructions recommend to only include “*pointers* to existing applications or use-case experiments”.
As regards the scientific and practical impact of the paper (only considering its parts which I perceive as relevant):
- There are no major scientific challenges addressed: the ontology structure mainly looks straightforward, its development has mostly been based on the common practice in the community, and the size of its *novel* parts is relatively small. Actually, it is to a large degree an abstract/prototype dataset schema (reusing many ontologies and complementing them with a few entities in a new namespace) rather than a compact ontology.
- OTOH, on the background of the pretty comprehensive SotA review, such a model might still be useful in practice, even beyond the consortium of the euBusinessGraph project.
An interesting aspect that would be worth elaborating is the relationship between the ontology and the EU legal space. Since the ontology has ‘eu’ in its name, it should pay specific attention to the common knowledge assets of the EU. While it may have been the case during its development, it is not properly explained which parts of the model are EU-specific (esp. those linked to particular codelists?) and what possible caveats might appear if it were used for non-EU data.
And, more specifically: the ontology has been built, to some degree, bottom-up, leveraging on datasets provided by four of the organizations of the co-authors (cf. p.6, lines 33-36). The paper should contain some distinct posterior evaluation on external, ‘retained’ company data sources, indicating that this bootstrap has presumably not induced major gaps or distortions.
As regards the ontology statistics: I am rather confused by the fact the authors list the numbers of classes, OPs and DPs, but do not distinguish how many of them are just reused from existing namespaces (and how many from which), and how many are newly defined. Furthermore, the authors obviously aim to provide the ontology in order to a gap in the ‘market’; however, there is no explicit discussion on those specific (smaller) gaps in the domain that had to be filled with those new ‘ebg:’ entities, such as WebResource or IdentifierSystem. Were there no alternatives for any of these, at all? Or not close enough? Or, not authoritative/popular enough to deserve reuse?
I am generally favorable towards large-scale reuse of entities from existing ontologies provided those are well visible and respectful, and the realization of this reuse by the authors looks sound. Yet, the chosen approach – direct reuse by replication - is just one of several possible reuse options, aside, say, the creation of proxy entities in the new namespace, the direct reuse by import (of whole ontologies) or the reuse by reference (w/o a proxy). The authors should attempt at a discussion of the pros and cons of their solution. Why and how would, for example, the chosen reuse model influence the adoption of their ontology? How convenient is the multi-namespace approach for the data publisher? See, e.g., the older study by Schaible [1].
[1] Johann Schaible, Thomas Gottron, Ansgar Scherp: Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling. ESWC 2014: 457-472
It is also unclear to what degree the various SKOS concept schemes referenced are a ‘part’ of the ontology or not. Can someone claim to completely use the ontology while choosing their proprietary codelists instead of those referred to in the ontology?
As regards the entity namespaces display in the paper, the authors silently introduce the following convention: in the diagrams, the prefix is appended, in braces, to the entity short name, followed with the cardinality; in text the short name is used only. This saves space and allows to focus on the semantic content; however, it also incurs some paging back and forth when the reader wants to recall the actual namespace (even in two steps, to Table 1, if the prefix is not familiar). OK, but it is again a choice that should be explicitly introduced and justified.
The quality of English in the paper is very good. There are just a few typos here and there. A technical problem related to typography is however the low resolution of the diagrams.
Detailed comments to the content:
- p. 3, 32-35: “we look specifically at works dealing with basic information about companies, covering organizational structures of companies, economical classifications of companies, company identification schemes, and locations of companies”. The notion of ‘basic’ company info, as the scope of the ontology, is not properly explained. What criterion is used to decide what is basic or not? Frequency of use in datasets? Some structural, or deeper ontological criterion? And what is, for example, non-basic, then? Especially that the ontology actually even describes *meta-*data on the company-description datasets and ID systems... which is not information about companies proper.
- p.4, 13-14: “The CBV is published by W3C as a part of public working draft named RegOrg since 2013.” Probably the same RegOrg as mentioned in lines 5-6?
- Section 3.1: The CQs only seem to cover three of the modules; there is no CQ for the Dataset module present.
- p.11, Fig. 3: At the first look it might not be obvious why two apparently related classes, ‘RegisteredOrganization’ and ‘Organization’, are not directly linked – whether by rdfs:subClassOf or by some other link. The authors do not discuss this mystery here, and it only becomes completely clear through the example in Fig. 8 much later. If I understand right, an organization in a registry ‘lives in a different world’ than a ‘general’ organization (classified by schema.org) that can be, for example, the maintainer of the registry. This is however a strong modeling commitment, which deserves some discussion.
- p.11, 44 – p.12, 1. As regards the use of OWLGrEd: OK, it is a nice tool, but how much does it actually offer in this particular case compared to plain UML?
- p.12, 9-10: “We used the Terse RDF Triple Language (Turtle) syntax as the file format for the ontology.” This is fine, but irrelevant for the paper. Any RDF serialization is just RDF.
- p.12, 33-37: “The ontology uses domainIncludes{schema} and rangeIncludes{schema}, which are polymorphic and describe which properties are applicable to a class, rather than domain{rdfs} and range{rdfs}, which are monomorphic and prescribe what classes must be applied to each node using a property. We find that this enables more flexible reuse and combination of different ontologies”. This is pretty laconic and uses non-intuitive terms w/o defining them. What is the exact meaning of ‘polymorphic’ here? By what mechanism does it lead to more flexible reuse? Is the EBG ontology sufficiently similar in spirit to schema.org (which has been primarily designed as a mark-up vocabulary for search engines) to justify the copying of this pattern? Which other respected ontologies use the ‘...Includes’ versions of domain/range? Where is it described as a best practice? The choice itself might be sound, but not w/o a more elaborate justification!
- p.15, 1-2: “The operational and/or legal registration status of the entity, e.g., whether a company is active or not. There is no globally accepted list of company states.” Probably, ‘of company statuses’?
- p.15, 26-27 vs. 34-35: It seems that the term ‘geographic coordinates’ is used in two different senses: “Least precise geographic coordinates are resolved at the level of a country” (broader sense, incl. full address) vs. “However, to represent geographic coordinates, Schema.org was used...” (narrower sense: lat + lon).
- p.18, 1 vs. 9-10: “isPartOf: System the identifier is a part of” vs. “The IdentifierSystem class represents a system managed by a publisher (e.g., a register or agency) that is used to issue identifiers to companies.” It looks semantically a bit odd to consider a system to issue identifiers that are at the same time its part. You can view the collection of IDs as a ‘system’, and you can also view the rules for creating those IDs as a ‘system’, but it should not be the *same* system.
- p.18, 14-15: Following up with the previous comment. The properties schema:author and dct:creator are normally used rather interchangeably afaik. Here you use them to make a very specific distinction: one refers to the author/creator of the system of rules, while the other to the author/creator of the IDs *using* this system of rules. In this context I see the stress on the maximal reuse of common properties as counter-productive. You should make clear what the system actually is (a collection of IDs, or the rules for coining them) and probably one new missing property should be introduced, with a lexical semantics clearly distinct from that of ‘creating’ or ‘authoring’. Either that of a subject that *applies* the system (as rules), or that of a subject that *sets up the rules for* creating the system (as collection of IDs).
- p.18, 44-45: “isPersistent: Whether identifiers can be removed from the register (e.g., when a company is dissolved)” To keep the same Boolean polarity, it should be changed to ‘cannot be removed from’, or ‘has to be kept in... (even if the company is dissolved)’. And, similarly, for ‘isImmutable’ in p.19, 1.
- p.19, 4-5: Same bullet repeated twice.
- p.19, 6-7: “isDumb” Why not rather call is ‘isOpaque’? This is much more technical than ‘dumb vs. intelligent’. Even a lexically meaningful ID does not have any intelligence per se, actually, merely some moderately ‘intelligent’ application can make use of it...
- p.19, 11: “isEnumerated: Whether the system has an issuer, and issued identifiers are kept in a database(register).” What does the opposite case look like? An example would help.
- p.19, 21-23: “replacementPattern: Pattern to use together with the validationRegex to normalize identifier values by removing optional decorations.” An example would be nice here, too.
- p.20, Fig. 7: Maybe OK, but just wondering – is “Issues company identifiers within the Atoka company database” the essence of the business activity of SpazioDati, so as to serve as its schema:description?
- p.20, 38-41: “An officer is a natural person (as opposed to a legal person) that has a high-level management role in a company... they typically serve at the will of the company directors, who can fire or replace them.” Does this mean that directors and shareholders are already beyond the ‘basic’ info on a company? This returns me to the general question raised in the very first comment on this list.
- p.21, Fig. 8: The caption is incomplete, it only refers to the OpenCorporates system and not to the official UK system.
- p.22, 46-43: “VOID describes RDF datasets in terms of entities (i.e., number of triples)” No, void:entities counts the entities and void:triples counts the triples.
- p.24, 26-28: “e.g., age and dateOfBirth attributes are connected by the following rule age=year(today) –year(date-OfBirth)” The example is a bit faulty. If someone is born on 31 Dec 2002 and today is 1 Jan 2020, s/he is barely 17, and not 18 as by the rule.
- p.33-35: The demonstration of the use of the ontology in the marketplace should rather be provided through some instructive diagram showing how data is integrated from different sources while solving a particular query, not as a screenshot of an app. (Provided you wished to keep some concrete use case example in the shortened paper. Most of the content of this section should however be removed, as I noted in the review intro part.)
- p.36: The cg: prefix is not even defined in the paper. The text on an extension of the ontology (though, possibly, interesting) is not an organic part of the paper. ONTO-CG is another ontology, after all.
Language issues:
- p.3, 31-32: “Several ontologies and data models were developed in the literature”. Either ‘developed’, or ‘described in the literature’, but not both. Models are not developed by writing papers.
- p.3, 34-35: “economical classifications of companies”. Rather, ‘economic’?
- p.4, 11: “public organizations, and criterion” Rather, ‘criteria’?
- p.15, 7-8: “that covers jurisdictions NO, GB, BG and statuses from data providers OpenCorporate, and SpazioDati and also from LEI.” Probably, ‘the jurisdictions’, and some connectives and commas should be fixed in this sentence, too, afaik.
- p.36, 30-31: “scholl” School?
Summarizing the evaluation along the standard dimensions for SWJ onto descriptions:
(1) Quality and relevance of the described ontology: solid artifact (with just a couple of likely, relatively minor, flaws), though no major research challenge addressed.
(2) Illustration, clarity and readability of the describing paper: solid, but contains additional parts that do not fit well to the scope of an onto description.
For me, the overall recommendation is obvious: major revision, consisting in removing the low-relevance parts (replacing the long ‘use case’ summaries with very short descriptions, plus possibly a table) and fixing most of the minor-to-medium severe issues in the remaining text (and, if adequate, in the ontology, too, in a few cases).
|