# Explainable Zero-shot Learning via Attentive Graph Convolutional Network and Knowledge Graphs

### Tracking #: 2465-3679

Authors:
Yuxia Geng
Jiaoyan Chen
Zhiquan Ye
Wei Zhang
Huajun Chen

Responsible editor:
Dagmar Gromann

Submission type:
Full Paper
Abstract:
Zero-shot learning (ZSL) which aims to deal with new classes that have never appeared in the training data (i.e., unseen classes) has attracted massive research interests recently. Transferring of deep features learned from training classes (i.e., seen classes) are often used, but most current methods are black-box models without any explanations, especially textual explanations that are more acceptable to not only machine learning specialists but also common people without artificial intelligence expertise. In this paper, we focus on explainable ZSL, and present a knowledge graph (KG) based framework that can explain the feature transferring in ZSL in a human understandable manner. The framework has two modules: an attentive ZSL learner and an explanation generator. The former utilizes an Attentive Graph Convolutional Network (AGCN) to match inter-class relationship with deep features (i.e., map class knowledge from WordNet into classifiers) and learn unseen classifiers so as to predict the samples of unseen classes, with impressive (important) seen classes detected, while the latter generates human understandable explanations of the feature transferability with class knowledge that are enriched by external KGs, including a domain-specific Attribute Graph and DBpedia. We evaluate our method on two benchmarks for animal recognition. Augmented by class knowledge from KGs, our framework makes high quality explanations for the feature transferability in ZSL, and at the same time improves the recognition accuracy.
Tags:
Reviewed

Decision/Status:
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 01/May/2020
 Suggestion: Accept Review Comment: The article is about explainable ZSL, and presents a knowledge graph (KG) based framework that can explain the feature transferring in ZSL in a human understandable manner. The framework has two modules: an attentive ZSL learner and an explanation generator. The former utilizes an Attentive Graph Convolutional Network (AGCN) to match inter-class relationship with deep features (i.e., map class knowledge from WordNet into classifiers) and learn unseen classifiers so as to predict the samples of unseen classes, with impressive (important) seen classes detected, while the latter generates human understandable explanations of the feature transferability with class knowledge that are enriched by external KGs, including a domain-specific Attribute Graph and DBpedia. I have carefully reviewed the changes that the authors have made in blue highlight. Most of my concerns have been addressed; hence, I recommend accept.
Review #2
By Dagmar Gromann submitted on 31/May/2020
 Suggestion: Minor Revision Review Comment: Overall, the authors have addressed most of the comments from the previous iteration of the paper. However, from my perspective there still is a major issue preventing acceptance of the paper. The issue is that the evaluation is only done on unseen classes. I had raised that issue in my previous review. The authors answered: “ It is known that there are usually two testing settings in ZSL. One is standard ZSL, which predicts the testing samples of unseen classes on unseen classes. The other is generalized ZSL, where the testing samples of seen and unseen classes are classified with candidate labels from both seen and unseen classes. In this paper, for investigating the feature transferability from seen classes to unseen classes, we focus on the standard ZSL setting to evaluate the prediction ability of unseen classifiers and generate explanations for these unseen classes. It is also worth considering how to deal with explainable ZSL in generalized ZSL setting in real-world applications. Maybe we can adopt a two-phase framework -- a coarse-grained phase to judge if a testing sample comes from seen classes or unseen classes, and a fine-grained phase to make final predictions, where traditional classifiers (e.g., softmax classifiers) are used to predict its label with candidates from seen class set if the sample is from seen classes predicted by coarse-grained phase, and ZSL classifiers are used to predict its label with candidates from unseen class set if it belongs to unseen classes. We can make further attempts for this in the future” My view is that this is not something to just look at in the future. It is perfectly justified, and even essential, to have an experiment where you want to show transferability. However, I see also a strict need to evaluate with the known classes in place. As far as I currently understand your work, there is also no need to do a two-stage process. Just treat the known classes in the same fashion as your unknown ones. This task will of course be harder, and that is exactly the point. I am expecting the results to be much worse as what you currently obtain. This, however, would be still an interesting outcome, because it would show that 1) you can transfer learn, but 2) when having both known and unknown classes things do not work as well. Besides, it would be very existing if you can provide deeper insight in how the class concussions occur most often. Either, the confusion is more or less uniform (not so interesting case) or the confusion happens most often between seen and unseen classes, which would give us further insights. I do have some more minor issues below, but I see having this experiment as a major missing piece in this paper. I was considering a major revision to make sure this issue was amended, but this would lead to an immediate reject. Hence, I decided to go for a minor revision, and ask the authors to perform such an experiment for the next version of the paper. A second issue that needs still more attention is describing how the features flow between the models exactly. I am not getting the whole picture, still. It might have something to do with the phrasing. For example, I do not get the sentence “With learned feature vectors of classes, we use the CNN classifiers of classes as the training supervision to map inter-class relationship into deep CNN features so that predicting a visual classifier for each class node”. Is it correct that the features coming out of your AGCN are never really put into the CNN, but only used in the end to compute a dot product which is then interpreted as the score? The same confusion might be solved if I understand what $f_i$ in formula 4 exactly is. Is it the outcome of a pre-trained CNN? If so, why do you call it “the classifier of seen class i”? Minor issues In equation 3, I am surprised to see that \hat{v}_i is computed using attention on the neighbors, but not using the state of the node v_i itself at all. Is that intentional? Why? As it was now mentioned, it caught my attention that you have an extremely large state in the nodes (2048). What is the reason for that choice? You write “our model is a regression model rather than a classification model, which usually works better.” Which of the two works better? For which case? There are a couple of issues which you gave more attention in your cover letter as in the paper. Perhaps you can also expand your explanation in the paper further.