Review Comment:
This ("ontology") paper describes "ML Schema", an upper-level vocabulary for representing machine learning experiments. The schema (specification from 2016 available at http://htmlpreview.github.io/?https://github.com/ML-Schema/documentation...) is the outcome of a community effort (W3C ML Schema Community Group) and is presented in this paper.
Positive:
- Mostly easy to read and follow paper.
- Timely topic of relevance for the SWJ community.
- Some initial tool support available (e.g. export from OpenML)
Negative points:
- Schema appears to still lack maturity in parts (eg when it comes to a clear definition of concepts and a sound distinction between classes/subclasses and instances, i.e. TBox/ABox)
- It seems unclear if the schema is generic enough to capture all kinds of ML equally well (details below).
- Adoption is unclear: given that the schema is out there for three years and is a long time in the making, I was disappointed not to see any proof of actual adoption, e.g. data about real-world ML experiments being publicly shared using ML schema.
- Presentation is not ideal and raises questions (details below).
- No actual instances are provided (or KBs based on the schema) and it is unclear what kind of inference is meant to be supported by the schema (if any) and how it actually supports interoperability. Here, some real-world use cases/data and examples on how it facilitates certain competence questions would be useful.
Detailed comments:
The topic is timely and the presented approach of offering a top-level ontology able to link/bridge between different related vocabularies seems reasonable. The paper is easy to read and some tooling is mentioned, e.g. able to export OpenML data following the ML schema.
While the paper is reasonably well-written, the structure and lack of a general overview makes certain parts hard to follow. For instance, the set of properties in Section 2.2 are hard to understand and assess without a more general overview of the entire model and classes first. Some questions arise already here about the distinction between the schema and the instance-level: e.g. hasHyperParameters is defined as relation between an "implementation" (of an ML "algorithm") and its hyperparameter. First of all, the relation should be named "hasHyperParameter" (as it is instantiated for a single hyperparamter). Second, the distinction between "model" and "implementation" at schema- and instance level seems blurred. A model supposedly is the instantiation of a particular implementation (e.g. a trained random forest "model" for a particular task, using the Weka implementation). Here, an "implementation" (and/or an "algorithm") has hyperparameters ("number of trees") but the model itself would be associated with the instances of the hyperparameters (e.g. "number of trees=20"). How is this intended and what is the modeling approach behind it?
Typically, different "configurations" of models are tested (eg the same SVM model trained on the same data but with different hyperparameters or a model for the same task but using different variations of training data (eg balanced/unbalanced). Wouldn't it make sense to introduce the notion of a configuration here? Also, such matters are very different depending on the task type (supervised/unsupervised) and I don't see how these differences are accounted for by the schema.
A minor comment in this regard: "hasOutput" may also be confusing, given that here "output" refers to the model itself, but in traditional neural network settings, one would use "output" to refer to the prediction output of a model.
Also, what do you mean with "entities" in the description of the "hasQuality" property and with "information content entity" in the "specifiedBy" property? That also seems rather unclear.
A similar problem arises with some of class descriptions in the paper. Looking at Table 1 ("Task"), it describes as "example classes" different ML task types (classification, clustering etc). These would be *subclasses* I suppose? Then you describe an example instance ("Classification on Dataset Iris"), but at the same time, you refer to the OpenML "TaskType" where I suppose, instances of task types are the actual types (eg "classification", "clustering") and not the actual tasks ("classification of X"). In general, it is unclear in the tables what is meant by "relation with aligned ontologies" (what kind of "relation", equivalence?).
The same problem is apparent with the "EvaluationMeasureClass" (Table 8): the "subclass" examples are "ClassificationMeasure", "RuntimeMeasure" etc and the individuals are describes as "RMSE" etc. Wouldn't RMSE just be another subclass (a specific type of measure) and the instance would be "RMSE= 0.6"? The authors should be more clear here and better define the schema and also illustrate its use with examples. Atm, one can only assume that it's entirely thought through and instantiating the model will raise a number of questions.
When describing the "data" class, you define it as "a data item composed of data examples". What do you mean with data examples? Instances? What would be properties here? Aren't there vocabuluaries such as DCAT which could be used here in addition? IMO describing the dataset in a reproducable way is a huge challenge, but one could refer to VoiD or DCAT and the like and make sure that a URL/identifier is provided from which data could be obtained. This would be one of the most crucial properties and contributions for ensuring that the ML models actually are reproducable and understandable.
The "model" class states: "we define Model as a generalisation of a set of training data able to predict values for unseen instances". This is (a) unclear (what do you mean with a "generalisation of a set of training data") and (b) would cover certain task types (eg classification/regression) but not others (clustering) which are unsupervised. This has reinforced my overall impression that the schema is not as generic as it intends to be and may not cover ML in all its diversity. Wrt classification/regression, the model should be the output of a particular "run", which in turn is a "run" of a "data"/"implementation" combination.
Similar doubts apply to the "run" class. A "run" of a clustering "implementation" (say k-means) is a very different case than a "run" of a classification "implementation". One spits out clusters (i.e. the direct outputs), the other spits out a model from which to generate outputs (eg labels/classes). The schema seems not to cater for this kind of diversity. Also, and this seems even more crucial, it's not clear if the "run" here reflects the training stage (of supervised models) or the test/classification stage.
In Table 9 ("Study") it is also not clear what is meant here ("a collection of runs").
In summary, while I believe this is a worth-while effort, the paper (and schema) requires more clarity, important questions regarding the instantiation of the model should be addressed and examples of use should be provided, to illustrate its impact and demonstrate that the schema actually adds value to the problem of understanding, finding, interpreting ML experiments. Atm, this is not supported by the paper, even though the latest schema spec is out there since 2016.
I do hope you'll find this feedback useful.
Stefan
More minor comments:
- Figure 1 is not very clear and not very well described.
- beginning of section 3.4 "a a prior...."
|