Review Comment:
This review was done jointly with Md. Rezaul Karim
===================================================
The paper presents an interesting idea, namely the use of a reasoner to improve the quality of a classification model, perhaps even an initial step towards merging symbolic and data driven artificial intelligence. While we like this idea, the current work does not sufficiently evaluate the impact of the reasoning component and hence we recommend a major revision of this paper.
The presentation style of the paper is of a good level and the text is well readable and clear. Besides, the paper is relevant for this special issue.
There are several major issues we currently see with the paper, they are discussed here. Apart from these, there are some minor language mistakes here and there can be corrected by the authors while carefully reviewing the final version of the manuscript.
One of the main issues with the paper is the limited amount of experiments performed and the parametrization of the model. Also, the choice of CAE seems rather conservative, other newer models could also have been at least tried, for example those which rely on attention or stacked convolutional autoencoders with reconstruction probability. Currently only one network is trained, and it is unclear whether the choice of network is even the best possible option for this use case. Also, one might expect the use of specific image segmentation techniques (specifically semantic segmentation techniques) which were not used, nor elaborately discussed in the related work section.
A related issue is that the proposed technique does indeed improve the presented model, but it is unclear whether this is due to the initial model not being the most suitable for the task in the first place. The question is hence whether the proposed technique is still able to improve other (better) models as well.
The use of the proposed OntoCity ontology seems a very reasonable one. One aspect which is, however, completely untested is what the influence of this choice are. It might, for example, be that using a different ontology (e.g. GeoNames ontology, see http://www.geonames.org/ontology/documentation.html) would give a different performance. Similarly, it is unclear what the precise effect is of the spatial constraints introduced in 3.4.1. They seem logical, but, it is not at al shown that they improve anything.
The choice of only using data from on flat city area seems a bit strange to us. One would expect that an experiment is also performed on, for example, a rural area. This way, the robustness of the proposed approach can be demonstrated.
It is unclear to us how the data is exactly flowing back from the reasoner to the network. We understood you do add extra channels for the features, like, for example shadow. What is unclear is what kind of input you send on these channels. Also, it is unclear whether there is a way to include all information derived by the reasoner. Besides, it is not explained how these channels are provided with data during the test phase.
In section 3.3, the authors mentioned that they used certain hyperparameter setting such as number of layers, filter size, optimizer without providing any justification. How did they chose those hyperparameters? Did they perform any grid/random search with cross-validation? Or did they chose them randomly? Apart from this, did they use some advanced technique while training the network such as batch normalization, dropout as a means of regularization etc.?
We suggest you would compare your approach with the following baseline approach: first train a model to predict elevation from the image. Then, you augment the input to the network you are augmenting with this information (using an additional channel as you did). If you are able to beat this baseline with the proposed approach, it is a much stronger indication that the semantic referee does indeed improve the classification performance.
The training times for the models are somewhat ambiguously reported. For round 1, you use 72 hours, while round 2-4, get 24 hours each. What is unclear is whether the rounds 2-4 continue on round 1's results or not. If they do, then the comparison is rather unfair. What you should be doing instead is creating a model without your additions, which gets 72+(x*24) hours of training time. So, you should compare your model at round 3 with a model which has received 144 hours of training time.
You do mention that you use early-stopping to stop the training, but do not specify anyhow how that was parameterized, specifically how the validation set was created is not mentioned in section 3.2.
Finally, in your conclusion, you state that the richer the ontology, the more meaningful the explanation from the reasoner. While this seems a valid statement, this is not really the question. Rather, the question should be whether "the richer the ontology, the more useful the explanation from the reasoner for the training of the model."
You do report the confusion matrix in table 2. This is interesting to see, but you should report the same table also for the classification after the reasoner has done its job. When performing (many) more experiments, or when doing experiments with more classes, reporting the RMSE would be sufficient, complete results should still be reported in either the appendix or be stored in as permanent.
The source code for your work is not available for the review. Also, you should share the trained model and perhaps some sample satellite images for testing to make it possible for readers to improve upon your work.
Besides these issues, there are still several minor concerns:
Section 1, page: authors have stated, “Machine learning algorithms and semantic web technologies have both been widely used in geographic information systems”. Is there any references to support this statement any related work etc. Then on the same page, they mentioned that Semantic Web could be used for localization. What kind of localization do they mean? Any references?
In section 3.3, the authors mentioned that the parameters were initialized using Xavier initializer. However, this can only be used for initializing network weights, not for the other network parameters. How are they initialized?
In section 3.6, authors have stated, “there are a number of ways that the output from the reasoner (i.e., the error explanation) can influence a neural network-based classifier, e.g., training set selection, data selection, architecture design, and cost function modification”. Could you argue why you made that particular choice?
In the same section, the hardware configuration text (i.e. experiment setting) can be moved to the beginning of section 4.3.
Style issue: please correct the inter word spacing on page 6. It becomes hard to read the text as the concept names and relations are not breaking properly.
It might well be that we are getting this wrong, but it appear like the numbers in table 1 should add up to 100, as you take the top 100 points. Where do these higher numbers come from? Actually, we do not get from the text what the meaning of the table exactly is.
We do not get the reason to introduce oc:intersects, really. We agree there might be cases where you do not really need to know which of the 2 it is, but having this extra information throughout the system does not seem harmful either.
In some place you mention that 'the spatial reasoner is responsible for explaining the errors'. we think this is an overstatement. To me it seems that the reasoner is currently only augmenting the knowledge about a pixel. Then, it seems like all explanation is done by humans.
How would the performance be with using all classes and not just these 5 rather easily separable classes. We would expect that your approach would show its strength much more in this case.
|