Review Comment:
The authors propose MOntCS, a new ontology for structuring commonsense knowledge bases (KBs. MOntCS aims to address several shortcomings of popular KBs like ConceptNet, including the inconsistent granularity of the events and redundancy of relations. The design guidelines include restricting the nodes to verb and noun phrases (to control specificity) and introducing structural relations to connect compound nodes to their constituents. In addition, the proposed ontology contains four classes of relations, including verbal (semantic), taxonomic, and affective relations, to cover a broad category of commonsense relations. The authors instantiate MOntCS on the Worldtree factors corpus ([1]) using semi-automated annotation. Experiments on Worldtree QA show that the KB obtained using MOntCS marginally improves over ConceptNet.
Commonsense reasoning is an increasingly popular area of research, and resources like new KBs and Ontologies can be valuable to make progress. The proposed work highlights some important shortcomings of the existing KBs, like the inconsistency in granularity and redundant relations. The new contribution, MOntCS, takes promising steps towards achieving these goals. However, grounding some of the claims in experiments and analysis, and clarifying parts of the paper will make it stronger.
## Areas of improvement
1. Lack of focus: the exact problems that the paper can be better motivated. It would have been better if the said shortcomings (multiple relations and inconsistent granularities) were grounded in empirical analysis, but the current version offers little to no evidence for this. There are works, for example [2-3], which are strengthened by multiple relations between entities in conceptnet. Thus, it might be useful to explain the said shortcomings better. Further, L19 states: "The design of ontologies for more general, open-ended common sense reasoning has been so far under-explored, and this is where our current paper's focus lies." However, MOntCS is inherently dependent on the available knowledge and, as such, does not address the problem of the presence of more general resources for common sense reasoning.
2. Experiments
I think the current experiments do not offer sufficient evidence for the utility of the proposed Ontology. In detail:
2.1) Table 9, Column 1 shows that MOntCS has only marginal gains over Conceptnet, despite having a 1:1 alignment with the task, as the authors note. Further, the fact that the task overlaps with the graph makes WorldTree QA an uninteresting (and perhaps unfair) candidate for evaluation. To establish that the proposed scheme is general, I suggest that authors try MOntCS on at least one more task. For example, WIQA~[4] might align well with MOntCS.
2.2) Table 9 should have another row involving no graph for a fair comparison. I suspect that neither of the graphs is helping (and some might be adding noise). Adding a no-graph baseline will clarify this.
2.3) Adding significance tests and repeating the experiments for different seeds might also help establish the difference between the graphs in columns 1 and 2.
2.4) Table 10 essentially shows that taxonomic relations are not helpful. Do they still need to be included? Further, column 1 of Table 10 indicates that the performance is essentially the same without any graph? This relates to the point made earlier about a no-graph baseline.
3. MOntCS as a tool for explanation
As authors mention in Section 6, "Models are increasingly evaluated not just on performance, but on their ability to provide explanations for the choices they make. We design MOntCS to be a suitable medium for expressing explanations in the common sense question answering domain." However, the experiments section does not show any evidence that MONtCS can provide valuable explanations. Adding experiments on this front will significantly strengthen the paper.
## Grammar/typos, style, and presentation :
1. Page 2, L32: "Relation this scenario…"
2. Page 3, L43: "in the graph to take a.."
3. It might be better to either add a citation for statements like "ConceptNet [10] is perhaps the most frequently used knowledge graph for common sense reasoning applications." This will allow dropping speculative phrases like "perhaps," which might not work well in this setting. Alternatively, you can rephrase them to something like "ConceptNet [10] is one of the most frequently used knowledge graph for common sense reasoning applications.". A similar statement is present on page 5, L 11: "ConceptNet is most commonly used as the base knowledge graph, a subset of which is chosen for computational reasons."
4. "As a graph grows denser, it becomes easier to select relevant data that may otherwise require many hops to reach from the starting nodes." Similar to the above, this statement sounds general but will depend on the specifics of the underlying graph. The relevant information may or may not appear closer as the graph grows denser. Thus, it might be helpful to qualify this statement and explain why this is expected. Another such statement appears on Page 15, L37: "a path length of 2 as used in prior work is insufficient."
5. Page 8, L3: "additioanl"
6. Page 12, L34: "missing structural links where they were missing"
7. Page 15, L32: "However, because QA-GNN also includes this embedding within the GNN (figure 1, label '2'), the *langauge model can still be trained."
8. Section 3.4 is currently very nicely written and is one of the most interesting aspects of the paper. It is interesting because it clearly lays out the problems and presents possible solutions and design choices with motivating examples. I believe that some other parts of the paper (example, Section 5) can be improved with Section 3.4 as a reference.
9. Given the central role that Worldtree plays in this work, it is worth adding a sample Table either in the appendix or the main paper.
10. Section 5 can use a re-write for clarity. Several statements are either mentioned without appropriate citation (a path length of 2 as used in prior work is insufficient) or are unclear, like "Ensuring fairness in this scenario is difficult.”
## Questions:
Q1. Why is Causes (Table 5) not placed in Affective relations?
Q2: Will "semantic relations" be a better term for verbal relations?
Q3: Is redundancy necessarily a bad thing? The paper lists redundancy as one of the main shortcomings of conceptnet. However, it is unclear why the responsibility of disambiguating the proper relation should not be with the downstream application. Further, the redundancy can sometimes be advantageous by providing the multifaceted nature of the relation between two nodes.
Q4: The second shortcoming of the level of specificity of the nodes is related to the underlying data source and not as much as a problem with the nodes. Since the authors use WorldTree, whether or not the derived KB is at the right level of granularity is to a great extent a function of the granularity at which the facts in Worldtree are expressed?
Q5: Page 15, L27: "The design of QA-GNN does not ensure that the graph…difficult to know the extent to which they drive performance in this model." Doesn't this directly undermine the motivation for having a knowledge graph at all? Please also see my note on providing evidence for using MOntCS as an explanation.
[1] Jansen, Peter, Elizabeth Wainwright, Steven Marmorstein, and Clayton Morrison. “WorldTree: A Corpus of Explanation Graphs for Elementary Science Questions Supporting Multi-Hop Inference.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA), 2018. https://aclanthology.org/L18-1433.
[2] Xu, Yichong, Chenguang Zhu, Ruochen Xu, Yang Liu, Michael Zeng, and Xuedong Huang. “Fusing Context Into Knowledge Graph for Commonsense Question Answering.” In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 1201–7. Online: Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.findings-acl.102.
[3] Wang, Han, Yang Liu, Chenguang Zhu, Linjun Shou, Ming Gong, Yichong Xu, and Michael Zeng. “Retrieval Enhanced Model for Commonsense Generation.” ArXiv:2105.11174 [Cs], May 24, 2021. http://arxiv.org/abs/2105.11174.
[4] Tandon, Niket, Bhavana Dalvi, Keisuke Sakaguchi, Peter Clark, and Antoine Bosselut. “WIQA: A Dataset for ‘What If...’ Reasoning over Procedural Text.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 6076–85. Hong Kong, China: Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/D19-1629.
|