Review Comment:
Summary
--------------
This paper presents the SML-Bench, a benchmarking framework for evaluating inductive learning tools from ILP and the semantic web. The authors have conducted a thorough study in the related literature for identifying candidate datasets and classification learning scenarios and have applied the benchmark for evaluating the accuracy of 8 learning tools. The authors present first the challenges for creating such a benchmark, then provide an overview of the architecture with technical implementation details and finally present the experimental results of the evaluation. Overall, as many ML efforts exist in the area of the semantic web and the DL, the problem addressed in this paper is very challenging and the results are promising. Below are some more focused comments.
(1) Quality, importance, and impact of the described tool or system
--------------
This paper presents a tool for benchmarking structured ML tools. On the positive aspects, the authors have conducted a detailed study for finding candidate datasets and scenarios for ML benchmarking tasks, they have provided a detailed evaluation study on 8 tools and they have implemented a framework that can be potentially extended for benchmarking other tools as well. On the negative aspects, there is a lack of clear goals and elements of the benchmark regarding the quality factors that are measured and used for comparison; besides the accuracy, there is no assessment about the performance (e.g., time) for completing a task, and about the scalability of each tool. Also, the learning problems refer only to classification tasks; it should be explicitly mentioned from the intro that this benchmark does not address other types of ML tasks.
(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.
--------------
The paper is well written, although many concepts and assumptions must be explained in more details, possibly through examples. In details,
Intro, p1, par2. Explain better what symbolic machine learning is, possibly by enriching the example (for the chemical compounds) with a) example rules (e.g, OWL syntax) and b) example algos (narrative) used for the classification. Mentioning “…using some algorithms…” and “…HORN rules…” does not help the reader understand the problem addressed. Also, mention and justify why only classification tasks are considered.
--------------
Intro. p.1, par1. Please put refs to the benchmarks mentioned in the intro.
--------------
Intro, p2, par1. The 2nd reason for the effort required to model background knowledge is not adequately justified. Why? OWL is widely used for this reason.
--------------
RW. Although the paper states some limitations of the SoA benchmarks (e.g., “…data is in tabular format…”, …”the provided datasets are not sufficiently structured…”, etc), there is no clear comparison between SML-Bench and the benchmarks referenced in this section and the contributions wrt the existing efforts. I would expect to identify a set of dimensions (e.g., size of datasets, lack of or existence of schema, ML problems benchmarked, etc) on which you compare with other benchmarks and summarize them in a table.
--------------
Section3. I would expect in this section (or after it) to include a section with the basic elements of the benchmark. Usually, benchmarks contain the data, the tasks (e.g., queries, ML tasks,etc) used for evaluation, the parameters for configuration, and the quality factors – goals to be measured\compared. For the latter, what are the measures that you consider?
--------------
Section 4, “Paper is available and Availability of the datasets”. Both criteria are trivial to consider\mention in your methodology. Please also state why you do not produce artificial datasets \ learning scenarios (for testing various parameters of the benchmark, such as scalability)
--------------
Section 4, Derivable Inductive Learning. Please explain whether the final review \ selection of all pubs and datasets was performed manually. How did you assess that the dataset represents an inductive learning scenario? Perhaps you could include an explanation sentence, on why only 11 out of 805 datasets of Table 2 have been selected.
--------------
Section 4, Table 3, Put a column with the origin or ref to paper for each dataset used in SML-Bench
--------------
Section 5, Please explain the extensibility of your framework for assessing other tools, what are the configuration steps?
--------------
Section5, p8, par1. Omit typo (positives)
--------------
Section 6. Please include an intro with the goals \ overview of the evaluation, wrt the datasets, the tasks, the measures and the tools tested (with any configuration applied). Is the goal of this section to evaluate the tools or the SML benchmark itself?
--------------
Section 6. Please explain why you do not consider performance or scalability assessment in your benchmark.
|