Review Comment:
SUMMARY:
This article proposes an ontology-based approach to account for training, implementation and performance details of statistical machine learning models in order to increase their accountability. It re-uses pre-existing ontologies to describe ML models and to describe obtained predictions, which are evaluated in a use case on energy efficiency.
OVERALL EVALUATION:
The idea to provide more detailed metadata on machine learning models/procedures is highly interesting. However, the proposed approach seems to consider a very limited range of models/procedures, mostly applicable to statistical ML models, especially when requiring an implementation in R. One basic ontology reused for accounting on models' predictions is derived from the domain of energy efficiency, but it is claimed that it is still generally applicable and easy to re-use. However, the chosen use case is then in the domain of energy efficiency, which fails to support this claim. Furthermore, the three people chosen for the evaluation are themselves developers/system managers, which says little about instilling trust in AI systems by general users. It is stated that the system had been evaluated with 120 units, however, no details of this evaluation are provided. In a nutshell, with a different use case and a rigorous evaluation procedure with end users this might be an interesting approach, given that the tool is described in more details and made publicly available.
SPECIFIC COMMENTS:
Introduction:
Accountability (page 2, par 2) is first defined based on a source that addresses explainability and in fact the definition fits the latter better than the first. Please delimit these two concepts more clearly and explicitly. Please provide first a definition of explainability and then explicitly a definition of accountability including distinctions between these two. As it is now, this distinction is not clearly marked. A stronger motivation on your proposed work, including explicit contributions, would probably increase the readability of this article.
Related Work:
Also the related work section suffers from a lack of clearly delimiting the concepts explainability, accountability, and trustworthiness. In this paper, it seems that you address accountability only, thus raising the question on why all these explainability approaches can be considered related work.
FIDES:
FIDES seems to be limited to models developed in R, however, the vast majority of available neural models (and traditional ML algorithms) is developed in Python or Java, Ruby, etc. Could you please explain this choice further and justify it?
Ontology:
The question "Which is the frequency of a given predictive model’s training data?" is very unclear - training data have no frequency per se. Do you mean frequency of individual items in the dataset? Also "Which is amount of observations used for training a given predictive model?" is not very clear - do you mean the size of the training dataset? The question "When was the last data point within a given predictive model’s training data collected?" presumes that training data are dynamically collected, which is not always the case. I propose splitting the questions further regarding the actual training procedure and the datsets involved in training a model and then structuring the compentency questions accordingly with different options for different types of training procedures/dataset collection procedures. For the question "Which is the base algorithm of the predictive model?" it would in many cases be unclear to me what to answer here, if it is not a traditional statistical machine learning procedure, but a deep learning architecture. And what if RMSE had not been used as a loss function?
Overall, it should be stated who answers these questions and when/for what reason. Is it developers who seek to utilize FIDES who answer these questions? Or would they be semi-automatically derived from the model, e.g. hyperparameter settings, datasets, loss function, etc. are generally available as explicit information.
While the EEPSA ontology might be a valid choice, even though the authors may be biased here, some of the elements, especially "Quality", should be specified more clearly (also the current online definitions do not help much here). What is the gap between the OntoDM core and the DMOP ontology to be addressed by the ML-Schema? How about improving the ML-issues addressed for FIDES instead of asking the ML community to do so?
Especially with the mapping of the two ontologies, it is rather unclear to me where all the metadata highlighted as important in the introduction and competency questions are represented or can be modeled for a specific implemented model, e.g. authorship, responsibilities, etc. The mapping also raises the question of what exactly can be gained by mapping these two ontologies. While the examples in the next section provide some ideas on this, it should be explicitly stated in this section.
The actual implementation of FIDES as a tool/system should be described in more details beyond the statement that there is a GUI and the specific pre-existing services are re-used for it. Is it publicly available?
FIDES in use:
To account for the actual generalizability of utilizing an "Energy Efficiency Prediction Semantic Assistant" ontology for general ML accountability purposes, it would be necessary to chose a use case that is not in the energy efficiency domain.
Why is the overall energy efficiency solution out of scope for this paper? How does FIDES account for privacy issues? While in some countries it might be legal to access such private information as energy consumption of your neighbors, in others it is not. How are privacy concerns of data addressed in this approach?
Evaluation:
The queries to the datasets are insightful and interesting, however, for a true evaluation in how far this approach increases trust of users, it would be interesting to provide an evaluation with the users in this use case and their view in how far such a system helps to instill trust in AI. Having two of the data scientists involved in training the systems and a system manager involved in the evaluation fails to address this critical point. The selection of participants of course also strongly affects the validity of the evaluation. A separation of evaluation and discussion would allow for a proper discussion of the proposed approach, which is currently intertwined with an evaluation that could benefit from improvements.
MINOR COMMENTS:
Please check SWJ guidelines on how to format your manuscript, e.g. how to refer to figures
Also a consistent and correct use of quotation marks would improve the manuscript.
p1.35 can be overcame => overcome
p1.43 including the explainability => including explainability OR the explainability of AI systems.
p1.Footnote => it extends into the text of the second column, please fix this
p2.3 The explainability => Explainability
p2.11 The accountability => Accountability
p2.34 as it would be needed to be an expert => as the person would have to be an expert
p2.36 the regular performance these accountancy tasks would be infeasible. => ??
p2.47 Since the AI is a field => Since AI is a field
p2.50 the machine learning (ML) => machine learning (ML)
=> please check the use of articles in the entire manuscript, too many to mention all here
p2.38 generation of ... into their AI-enabled systems => for?
p3.19 may act an effective way => may act as?
p3.35 To the extent of knowledge of author => the authors and shouldn't you check on what happened since 2020 in this regard?
p4.33 knowable topic => ??
p4 ff I recommend introducing acronyms, e.g. RMSE, SOSA, EEPSA, etc.
p5.32 may derive in => may result in
p8.42 Figure 5 => Fig. 5. (full-stop missing)
p12.21 would require from further functionalities => ??
p13.37 clicking in the ’Algorithm’ button => on
The overall language quality of the manuscript requires thorough revision.
|