Review Comment:
First of all, I would like to thank the author for the efforts solving the different concerns that aroused in the previous round. Some of the answers have solved the some of the raised concerns, but I'm afraid that the contribution of the current paper is not still mature in order to get accepted:
- When I referred to the definition of the prediction task, I was referring to the actual inputs used in each case and the actual effect and boundaries of the prediction. As I have understood it, the current approach would get the set of averaged features gathered in the latest (this is important) window, and would use them to predict just whether the current user is above or below the median of the performance metric at hand. In this case, what's the actual median of the users? given that they are not experts and are evaluating the validity and completeness of a set of ontology mappings, it would be interesting to know their own performance.
- Moreover, regarding the prediction task and setup, given that the performance is the value to be predicted, it is not surprising that the cumulative gaze information (therefore, the window with the most information about the whole process) is the better one, and that digests and snapshots using only locally temporal information behave worse. This prediction task election (along with the fact that it had been already proved in [39] with DL methods) jeopardizes the current contribution: what about predicting the outcome of the next subtask? The actionability of just predicting whether a user is going to be performant or not in a global sense every two minutes is quite limited, and, in fact, the author raises it as a potential limitation for short tasks. I understand that the user's behaviour can be quite different, this is, for example, first understanding both models, and then establishing the validity of the mappings all at once, or, on the other hand, validating one mapping at a time. I consider that at least some analysis of this should be done in order to increase the value of the contribution.
- Regarding the user characterization, the author states in the answer letter: "the results would not be very generalizable if a study were to require all participants to have an intricate technical background in SW technologies to complete, as such an expert user sample is not representative of the real-world users of SW technologies". In fact, it's quite the opposite, no regular user is expected to see an ontology directly (let alone establishing mappings) or to work with Semantic Web technologies. They have a steep learning curve (e.g., SPARQL is far from being trivial to be used even for people already trained in SQL).
Thus, given that the focus is on the analysis of the tasks itself, instead of proposing an adaptation method for ontology visualization, I'm afraid that, in its current state, I would advocate for a major review including at least a finer-grain analysis of the user's tasks, and, if possible (not mandatory), an analysis with users which should be considered experts.
Minor comments:
- In Section 7, the term task is used in an ambiguous way. It's referring to prediction task, but, please consider to use a different term to make a difference to the actual task that the user is doing (i.e., correctness is not an actual task).
|