Review Comment:
(Our overall impression is based on UK scaling, i.e., about a Merit for an MSc (appropriately adjusted).)
(This review is the result of a collaboration between a senior academic and a PhD student working in the area.)
(Note that the review is written in light LaTeX and should compile standalone except for the bibliography:
[1] T. Alsubait, B. Parsia, and U. Sattler. Generating multiple choice questions from ontologies: Lessons learnt. In OWLED, pages 73–84, 2014.
[2] G. T. Brown and H. H. Abdulnabi. Evaluating the quality of higher educa- tion instructor-constructed multiple-choice tests: Impact on student grades. In Frontiers in Education, volume 2, page 24. Frontiers, 2017.
[3] N. Karamanis, L. A. Ha, and R. Mitkov. Generating multiple-choice test items from medical text: A pilot study. In Proceedings of the Fourth Interna- tional Natural Language Generation Conference, pages 111–113. Association for Computational Linguistics, 2006.
[4] J. D. Kibble and T. Johnson. Are faculty predictions or item taxonomies use- ful for estimating the outcome of multiple-choice examinations? Advances in physiology education, 35(4):396–401, 2011.)
\documentclass{article}
\usepackage[utf8]{inputenc}
\begin{document}
The paper presents an ontology-based approach for predicting the difficulty of short answer factual questions taking into account the knowledge level of learners. Four features were proposed and the corresponding ontology based measures were defined. A prediction model that relies on the proposed features was developed and an evaluation of the prediction model is finally reported.
The major contribution of the paper is the definition of a new set of features that can be used for predicting the difficulty of short answer factual questions. Taking into account learners’ knowledge level in predicting difficulty is another distinguishing feature of the work presented. This is especially important in adaptive learning systems where materials need to be adapted to learner levels. The prediction methodology seems feasible. With regards to the presentation, the paper is well organised and easy to follow.
On the downside, the evaluation methodology and the result analysis are not reported sufficiently. There are also aspects of the data that were not considered in developing and evaluating the prediction model such as the distribution of difficulty levels and other aspects that I explain below. I could not interpret the results based on the reported information. Therefore, I have concerns about the implementation of the prediction model and the claims made about difficulty prediction. In particular, I believe that the claims ``The performance of the models based on cross-validation is found to be satisfactory” and ``Comparison with the state-of-the-art method shows an improvement of 8.5\% in correctly predicting the difficulty-levels of benchmark questions'' need additional support.
\paragraph{Recommendation} Major correction. I believe that, at least, deeper analysis of the data is required and the evaluation sections need to be rewritten. Collecting additional data might also be needed.
\section{Major remarks}
\begin{enumerate}
\item I assume that the aim is to approximate difficulty as indicated by student performance (Rash difficulty). However, the authors seem to imply that expert prediction is an accurate proxy for student performance which has been questioned in several studies (for examples, see \cite{kibble2011faculty}). This is apparent from the training data where observations about student performance and expert prediction are mixed together in order to increase sample size. In addition, the automatic prediction was compared with domain expert prediction as indicated by: “the predicted difficulty-levels of questions (chosen from four domains) were found to be close to their actual difficulty-levels determined by domain experts”. The target difficulty (student performance, expert prediction, or both) needs to be stated clearly. The training data and the evaluation need to be aligned with the stated goal(s). If the goal is to predict student performance, expert prediction and student performance should not be mixed together without a justification. Minimally, the agreement between them needs to be checked on a subset of the data. Mixing both difficulty metrics together seems plausible in cases where there is a large agreement. However, this needs more thoughts about, and discussion of, the implications.
\item Section 6, training data paragraph: A sample of 520 questions were selected for training. However, relevant practical information has not been reported. This includes:
\begin{itemize}
\item Why 520 questions? and is this enough as training data?
\item How are questions selected (e.g. random sample, stratified sample)?
\item What is the distribution of difficulty in the training sample?
\item What is the distribution of proficiency levels in the training sample?
\item Does the training set contain enough questions that capture all difficulty and learner levels?
\item How many observations about student performance and how many observations about expert prediction are there in the training set?
\end{itemize}
\item Section 6, training data paragraph: The authors mentioned that difficulty has been obtained in a classroom setting by using IRT but did not mention how were they recruited and how many students were involved. The literature suggests a large number of students (about 500 students) to use IRT (See: \cite{brown2017evaluating}). Due to the difficulty of obtaining participants, I expect that the difficulty information has been calculated based on a much smaller number of students. A discussion of why IRT based on a small cohort is expected to be accurate and whether using a simpler difficulty metric such as percentage correct was considered or not.
\item Section 6, training data paragraph: It is necessary to give more details about how expert prediction data were collected.
\begin{itemize}
\item How were the experts selected?
\item Since the authors used questions from four different domains (ontologies) and stated that each question was evaluated by 5 experts, do they have 5 experts for each domain?
\item What were they asked to predict (d, nd)? Was this required for each type of learners (e.g. q1 is difficult for beginners, not difficult for intermediate learners, not difficult for experts)?
\item Were they required to answer the questions as well?
\item What was their agreement on prediction?
\item How long did they spend on each question? (this is particularity helpful for future studies)
\end{itemize}
\item Section 6.1, paragraph 1: The authors reported an accuracy of 76.73\%, 78.6\% and 84.23\% for their three regression models. However, accuracy by its own does not show the full picture. What about the performance of the models on each class (performance on d, and performance on nd). This is especially important if the distribution of the classes is skewed. What about models' performance in predicting expert prediction, and student performance. Other metrics that can be considered are precision, recall, and f-measure.
\item Section 6.2, paragraph 1: The authors investigated the percentage of non-classifiable cases by analyzing questions generated from five ontologies. The relation between this set of questions (from five ontologies) and the set of questions used in training (from four ontologies) is not clear. Are the questions investigated in this section different from the questions used for training and evaluation? If you have more data, why these data have not been used for building, or evaluating the models?
\item Section 7, paragraph 2: the authors stated that ``Twenty four representative questions, selected from 128213 generated questions, were utilized for the comparison.” What is meant by “representative” need to be defined. For self-containment, the selection process needs to be outlined.
\item Section 7.1, paragraph 1: The author claimed 8.5\% improvement in prediction using the new set of features. However, due to the small sample size (24 questions), I have concerns about the generalisability of the results. This need to be discussed and mentioned in the limitations.
\end{enumerate}
\section{Some minor remarks}
\begin{enumerate}
\item Abstract: The authors stated that previous approaches suffer from being simple. Simplicity is not the real problem and the focus should be on the performance of previous approaches.
\item Introduction, paragraph 3: the authors stated that ``questions that are generated from raw text are suitable only for language learning tasks”. Several text-based approaches generate questions that are not intended for language testing. For example, see \cite{karamanis2006generating} and
work done by ``Michael Heilman”.
\item Section 2.1, paragraph 4: The authors stated that they studied ``all the possible generic question patterns that are useful in generating common factual questions". I believe that the number of question patterns is infinite and therefore the previous statement need to be quantified.
\item Section 4, paragraph 1: it is stated that the similarity theory has been applied to
analogy type MCQs. The theory has been applied to other types of questions. For more details, see the paper \cite{alsubait2014generating}.
\item Figure 1: What is the input for each classifier? How are features extracted?
\item Section 6, training data paragraph: according to the authors, questions used in the training set were generated from four ontologies. While a reader may look up information about these ontologies in the project website, it would be better to give some description about these ontologies and the generated questions locally (e.g. their size, whether they are hand-crafted or not, how many questions were selected per ontology) to make the paper more self-contained. If these ontologies were hand-crafted, this needs to be mentioned as a limitation.
\item Section 6, feature selection paragraph: What makes the selected feature selection methods `popular''? Was this based on the literature?
\item Section 6, feature selection paragraph: I assume that these feature selection methods take the correlation between the features into account, are they? Are there some correlated features?
\item Section 6.1, paragraph 2: The authors reported that their method correctly classified about 77\% of questions. Out of the remaining 23\%, how many were misclassified and how many were non classifiable?
\item Section 6.2, paragraph 2: Are there any questions where the actual difficulty for different learner levels was unexpected? For examples:
\begin{itemize}
\item questions that were easy for beginners but difficult for experts,
\item questions that were easy for intermediate learners but difficult for beginners and experts.
\end{itemize}
Any observations about the quality of these questions.
\item Section 7, paragraph 2: Using the term ``actual difficulty” is ambiguous unless it is defined earlier (e.g. Rasch difficulty and actual difficulty are used interchangeably).
\item Section 7, paragraph 2: The authors reported a correlation of 67\% between the predicted difficulty and actual difficulty. Providing the number of questions predicted correctly as in the following paragraph will make the comparison easier.
\item Conclusion, last paragraph: One of the limitations mentioned is that the method has been used on medium-size ontologies. Investigating its performance with large-size ontologies is stated as a future research area. Investigating the methods with small ontologies is also needed. I suspect that deriving the metrics from small ontology could give a worse prediction. For example, in small ontologies, the inferred class hierarchy is expected to be shallower and therefore the accuracy of `specificity' metric will be affected. A discussion of this and other similar ontology characteristics that affect metric performance is valuable to add.
\item There are different places where numbers need to be presented in order to support the claims made:
\begin{itemize}
\item Abstract: ``... is found to be \textbf{satisfactory}, what is the performance (in numbers)?
\item Abstract: ``8.5\% in correctly predicating the difficulty-levels of \textbf{benchmark} questions", what is the size of the benchmark?
\item Introduction, paragraph 4: ``In the E-ATG system, \textbf{a state-of-the-art} QG system ..", what makes it the state of the art? how does it perform compared to others (in numbers)?
\item Introduction, paragraph 4: ``we have proposed an \textbf{interesting} method for ...'', what makes it interesting? how does it differ from existing approaches? how does it perform (in numbers)?
\item Introduction, paragraph 4: ``Even though this method can correctly predict the
difficulty-levels to \textbf{a large extent}", how does it perform (in numbers)? any observations about cases where the method fails?
\end{itemize}
\end{enumerate}
\bibliographystyle{abbrv}
\bibliography{ref}
\end{document}
|
Comments
Typo
Dear Reviewers,
I have identified a typo in the manuscript. I kindly request you to take the following minor change into consideration while reviewing the paper.
In the Abstract, instead of "8.5% improved" it should have been "20.5%" (from 67% to 87.5%) -- the same mistake has happened at the conclusion section as well.
Thanking you.