Review Comment:
In general, this is a well written paper. However, I have concerns regarding the focus of the paper w.r.t the audiences of this journal. The paper is felt to an extended version of 2 of their previous works [6-7]. Mainly the experimental section was extended. No major contribution on the conceptual-side, however, this paper, especially due to their detailed experimental study, would definitely be useful for communities who work in the areas related to ontology-based question generation and automated educational assessment generation.
*Main concern*
The title of the paper itself says: "Experts Vs Automata" why this study is so important especially when there are previous studies both in the educational and physiological science communities that experts' predication of difficulty level is not so precise?
*Detailed comments*
This paper compares two existing ontology-based measures for predicting difficulty-level w.r.t to experts' predication. I wonder, if these were the only possible measures. Even though the authors mentions about machine learning based methods, a comparison with such methods would have been given this manuscript a better reach. However, I understand that this may not be possible if the domain specific data-set is not available.
The authors could clearly say that this work is an extension of their previous work[6-7] and they may rewrite the new or main contributions accordingly. Also, I feel that Paper-7 is cited too many times in the current manuscript. At one point, I felt that there is too much overlap between these two papers.
*Introduction*
RQ1&2 questions appear to be ambiguous while using *compared to* -- Rewrite!
We found that MCQ difficulty was moderately predicated by domain experts -- This results is previously well studied and reported in earlier works (mainly in educational communities). The author should cite such works instead, and may be they can report their observations for the current domain under consideration.
In the main contributions:
1. It is mentioned that the new ontologies based measures are proposed. However, I see that the measures were already published in their previous works.
2. A fairly large question set -- is it avaible online?
3. The experimental setup that should be followed for evaluating ontology-based metrics would be a main contribution.
*Methods*
More details about each type of question and.. --- Please add a reference. As I mentioned above, please avoid too many references to previous work; it is really affecting the flow and readability of the paper. Make it more self-contained.
*4.1.3. Procedure*
To check agreement among experts, questions were reviewed by two experts whenever possible -- What would be the criteria for a tie-break?
*4.2.1. Subject*
Are their any assumption regarding the level of expertise of these residents?
What is the significance of their "practical experience" for assigning their expert-levels: say, as high, medium, low ?
Please detail about the reasons for choosing the mentioned demographic characteristics of the residents.
This details is importation because based on the definitions of the various classes of difficulty-level mentioned in Section 4.1.3, there is an explicit assumption that expertise levels of the subjects under consideration should be a uniform distribution.
Refer Section-3 of [1] for a detailed definition of the difficulty-levels based on educational theories.
[1] http://www.semantic-web-journal.net/system/files/swj1898.pdf
(http://www.semantic-web-journal.net/content/difficulty-level-modeling-on...)
*5.2.1. Is the expert predication a good proxy for difficulty?*
the data suggest that experts were more accurate in their predication when they answers the questions correctly -- From the perspective of forming a "best-practice" for collecting experts observation, this a good observation even though it is very obvious. But, while detailing more about how the experts' predication should be collected, make sure that the focus of the paper should not go off. Being a SW related journal, the readers would be more interested in the comparison of predicated difficulty levels with the actual levels collected from students' performance than with experts' predication.
Please elaborate the reasons for why expert predication is considered as a major component of the evaluation framework. I think this is an importance point which determines the relevance of the paper.
|