Review Comment:
Summary of submission:
InterpretMe is analytical tool that traces the behaviour of predictive models built over data collected from KGs, for their explainability.
Using SHACL, the tool implements a set of integrity constraints that provide a meaningful description of a target entity of a prediction model.
Here it is unclear what target entity refers to. It might be clearer to describe the type of target entity: is it a hyper parameter, or a feature? It seems it is an instance from the train, val or test set.
The tool is targeted to predictive modelling (forecasting future outcomes based on past data) with data from KGs.
The tool's focus is automation assistance: interpretME captures metadata from the input KG (features, target classes), model and records SHACL constraints for data validation. InterpretME traces the optimised hyperparameters and estimated features’ relevancy, and records the model performance metric outcomes (e.g., precision) for a particular run. Moreover, SHACL validation reports are stored. Tracing the metadata collected from input KGs will help to provide explanations about the predictions made by the predictive models.
**This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions:**
**(1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided).**
The authors argue that from three types of automation (mechanisation (algorithms that can run by themselves), composition (in which sequences of tasks can be performed) and assistance (where a user is assisted in algorithm and output interpretation) least work has been done on aiding automated assistance. The authors see knowledge graphs as a great potential for aiding with this assistance, —> *I really like this idea of using semantic web technologies for easy integration between an input dataset and metadata produced in the machine learning pipeline for better interpretability, although the impact could be made clearer in my opinion: how interpretability is aided, and what the added benefit of doing this with KGs is, possibly with some references.*
With a use case in which cancer patient features should predict lung cancer biomarkers, they report *five questions* oncologists would still have after using well-known tools (e.g. LIME and SHAP) for model interpretability, that should be answerable with their InterpretmeKG. —> *Where do these five questions come from? Do they come from real oncologists? Do these five questions cover interests of domain experts after performing a predictive task on their KG data?*
Section 4.1: I am not entirely sure I agree that a higher node degree necessarily means that an entity is more *human-interpretable*, if there is no example to illustrate how that extra node information helps an oncologist or other domain expert interpret the results. Maybe add some examples on the information that is added for nodes and what an oncologist or other expert can learn from that additional information.
(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool. Please also assess the data file provided by the authors under “Long-term stable URL for resources”.
The paper does not contain many spelling errors, but the text could use a some grammar checks: some sentences miss an article here and there. Some sentences are a bit unclear as well and can be elaborated on.
It is clear what common methods miss by the motivating example. It is, however, not so clearly described what the limitations and requirements are for using this tool. Can you use it with any given KG, or does the KG + the training examples need to have a certain shape? What is the benefit of using this semantic technique over other non-semantic ones, specifically what is the benefit of mapping the data collected from the predictive models using RML to RDF? Does the user need to write its own SHACL shape constraints, and are there recommendations for doing so? It is unclear from Figure 4b what these constraints encode, maybe a more illuminating example would help. The motivating example mentions ‘great potential of integrating knowledge graphs with predictive modelling frameworks —> some references could be added here for clarification.
What is ‘the target entity’ referred to in the motivating example and also later in the texts? are these entities in the test dataset?
The three forms of automation from Bie et al. could be explained a bit better, the three short sentences are a big unclear (I had to look them up in the original paper to understand them).
In section 3.2 and image 3, as well as in the running example, it is at times unclear which part is done automatically by interpretMe, and which step is facilitated by InterpretMe but is a task for the user.
**4** empirical evaluation: ‘Each of the SHACL constraints validates a person’ —> could you give an example of a SHACL constraint here, since it would be the easier example to understand.
**4.1** —> unclear why degree distribution is a heading here, how does it relate to any of the RQs? Would help if there was a clearer mapping between the RQs and the headings that follow.
*WithInterpretME* —> I assume that this relates to the degree distribution on the input KG + the interpretmeKG, but this is not defined anywhere.
‘The execution of queries 1 and 4’ —> what do these queries query for?
**4.2 —>** ‘in terms of 20’ in terms of 20 what?
Some minor things:
Line 49 ‘entities of the target classes, e.g., HasSpouse’ —> is this not a relation?
Figure 4 text is very small & *og:1501042 → e*g:1501042 ? The part on entity alignment ‘entity alignment is performed to trace original entity of input KG with SHACL validation results and predictive modeling pipeline,’ is unclear to me. Why is entity alignment necessary and how is it done? Constraints could be more clearly described.
**(A) assess whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (D) whether the provided data artifacts are complete. **
URI for resources: https://github.com/SDM-TIB/InterpretME: the github contains an elaborate README.md and has some worked out examples that can be run on the fly, making it easy for the user to replicate the experiments mentioned in the paper. The files are well organised. A brief descriptions of what the example queries do would be clarifying.
|