Experts vs. Automata: A Comparative Study of Methods for a Priori Prediction of MCQ Difficulty

Tracking #: 1976-3189

Ghader Kurdi
Jared Leo
Nicolas Matentzoglu
Bijan Parsia
Uli Sattler
Sophie Forge
Gina Donato
Will Dowling

Responsible editor: 
Lora Aroyo

Submission type: 
Full Paper
Successful exams require a balance of easy, medium, and difficult questions. Question difficulty is generally either estimated by an expert or determined after an exam is taken. The latter is useless for new questions and the former is expensive. Additionally, it is not known whether expert prediction is indeed a good proxy for difficulty. In this paper, we compare two ontology-based measures for difficulty prediction with each other and with expert prediction (by 15 experts) against exam performance (of 12 residents) over a corpus of 231 medical case-based questions. We found that one measure (relation strength indicativeness) to be of comparable performance (accuracy = 47%) to the experts (average accuracy = 49%).
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vinu Ellampallil Venugopal submitted on 29/Jan/2019
Major Revision
Review Comment:

In general, this is a well written paper. However, I have concerns regarding the focus of the paper w.r.t the audiences of this journal. The paper is felt to an extended version of 2 of their previous works [6-7]. Mainly the experimental section was extended. No major contribution on the conceptual-side, however, this paper, especially due to their detailed experimental study, would definitely be useful for communities who work in the areas related to ontology-based question generation and automated educational assessment generation.

*Main concern*
The title of the paper itself says: "Experts Vs Automata" why this study is so important especially when there are previous studies both in the educational and physiological science communities that experts' predication of difficulty level is not so precise?

*Detailed comments*

This paper compares two existing ontology-based measures for predicting difficulty-level w.r.t to experts' predication. I wonder, if these were the only possible measures. Even though the authors mentions about machine learning based methods, a comparison with such methods would have been given this manuscript a better reach. However, I understand that this may not be possible if the domain specific data-set is not available.

The authors could clearly say that this work is an extension of their previous work[6-7] and they may rewrite the new or main contributions accordingly. Also, I feel that Paper-7 is cited too many times in the current manuscript. At one point, I felt that there is too much overlap between these two papers.

RQ1&2 questions appear to be ambiguous while using *compared to* -- Rewrite!

We found that MCQ difficulty was moderately predicated by domain experts -- This results is previously well studied and reported in earlier works (mainly in educational communities). The author should cite such works instead, and may be they can report their observations for the current domain under consideration.

In the main contributions:
1. It is mentioned that the new ontologies based measures are proposed. However, I see that the measures were already published in their previous works.

2. A fairly large question set -- is it avaible online?

3. The experimental setup that should be followed for evaluating ontology-based metrics would be a main contribution.

More details about each type of question and.. --- Please add a reference. As I mentioned above, please avoid too many references to previous work; it is really affecting the flow and readability of the paper. Make it more self-contained.

*4.1.3. Procedure*
To check agreement among experts, questions were reviewed by two experts whenever possible -- What would be the criteria for a tie-break?

*4.2.1. Subject*

Are their any assumption regarding the level of expertise of these residents?

What is the significance of their "practical experience" for assigning their expert-levels: say, as high, medium, low ?

Please detail about the reasons for choosing the mentioned demographic characteristics of the residents.

This details is importation because based on the definitions of the various classes of difficulty-level mentioned in Section 4.1.3, there is an explicit assumption that expertise levels of the subjects under consideration should be a uniform distribution.

Refer Section-3 of [1] for a detailed definition of the difficulty-levels based on educational theories.


*5.2.1. Is the expert predication a good proxy for difficulty?*
the data suggest that experts were more accurate in their predication when they answers the questions correctly -- From the perspective of forming a "best-practice" for collecting experts observation, this a good observation even though it is very obvious. But, while detailing more about how the experts' predication should be collected, make sure that the focus of the paper should not go off. Being a SW related journal, the readers would be more interested in the comparison of predicated difficulty levels with the actual levels collected from students' performance than with experts' predication.

Please elaborate the reasons for why expert predication is considered as a major component of the evaluation framework. I think this is an importance point which determines the relevance of the paper.

Review #2
Anonymous submitted on 13/Apr/2019
Major Revision
Review Comment:

The topic of the paper is quite interesting and relevant for the journal. Automatic creation of exams, especially using structured domain knowledge could improve significantly reliability of exams and their quality across different domains.

The paper is well written and organized.

The main comments I have with respect to the paper are mainly related to the experimental procedure, which requires significant in depth clarification additions:
- reference [7] has been continuously referred to in the paper, however such paper is not to be found online. It is inappropriate to cite papers which are in review and not accepted at the time of submission. This is a major drawback as critical explanations for understanding the methodology are supposedly explained there. It is also not clear how the current study and the study in [7] are different.
- when subjects are described in the experimental section it is also important to know how were they recruited and what was the payment. Only the demographics are not sufficient to judge the appropriateness of both experts and residents.
- it is interesting to observe that out of 15 experts 10 had no or less than a year experience in constructing exams. Experience with exam construction is, in my opinion, crucial in such kind of study, as the judgement on appropriateness, difficulty, etc can vary significantly with experience.
- "out of 435 questions, 375 questions were rated as appropriate by at least one reviewer" - here it is important to explain what was wrong with the 60 questions, why did they judge them inappropriate and was there a specific scale used, or just a binary answer. Additionally, the 375 questions has at least one appropriate score - what happened when the reviewers disagreed? Were these questions studied, why did the experts disagreed? Why did you decide to take a question if one of the two reviewers disagreed, and not just omit the question?
- "Questions were reviewed by two experts whenever possible" - please explain. What does it mean 'whenever possible'. Provide more details how many questions were reviewed by how many reviewers. See the comment above - explain also how did you deal with the disagreement and with questions reviewed by only one expert. How reliable this could be? If 'orthopedics' had only one expert - shouldn't you just not included this domain in the experiment?
- It seems that there is misbalance between the domain expertise of experts and residents. Many of the residents were from expertise not covered by most of the experts. How does this effect the experimental design reliability and the results.
- Explanation was was provided to experts only if they got the answer wrong. However, experts we asked to judge the explanations to the questions. This might create bias to which explanations experts saw and the overall judgement of reliability
- How were the different aspects of the questionnaire rated by the experts, was there a scale, what was the agreement / disagreement on this among experts

With respect to the research approach, I wonder how transferrable this is to other domains. The authors have only published in medical field on this topic. For this research to have a more general impact it is important to study the transferability to other domains, as well as to study the impact it can have on the difficulty estimation when domains are not well structured, or multiple taxonomies exist. As such Section 3, 6 and 7 need a substantial clarification on these points.