Adversarial Transformer Language Models for Contextual Commonsense Inference

Tracking #: 2932-4146

Pedro Colon-Hernandez
Henry Lieberman
Yida Xin
Claire Yin
Cynthia Breazeal
Peter Chin

Responsible editor: 
Guest Editors Commonsense 2021

Submission type: 
Full Paper
Contextualized commonsense inference is the task of generating commonsense assertions from a given story, and a sentence from that story. (Here, we think of a story as a sequence of causally-related events and descriptions of situations.) This task is hard, even for modern contextual language models. Some of the problems with the task are: lack of controllability for topics of the inferred assertions; lack of commonsense knowledge during pre-training; and possibly, hallucinated or false assertions. We utilize a transformer model as a base inference engine to infer commonsense assertions from a sentence within the context of a story. The task's goals are to make sure that (1) the generated assertions are plausible as commonsense; and (2) to assure that they are appropriate to the particular context of the story. With our inference engine, we control the inference by introducing a new technique we call "hinting". Hinting is a kind of language model prompting, that utilizes both hard prompts (specific words) and soft prompts (virtual learnable templates). This serves as a control signal to advise the language model "what to talk about". Next, we establish a methodology for performing joint inference with multiple commonsense knowledge bases. While in logic, joint inference is just a matter of a conjunction of assertions, joint inference of commonsense requires more care, because it is imprecise and the level of generality is more flexible. You want to be sure that the results "still make sense" for the context. We show experimental results for joint inference, including three knowledge graphs (ConceptNet, Atomic2020, and GLUCOSE). We align the assertions in the knowledge graphs with a story and a target sentence, and replace their symbolic assertions with textual versions of them. This combination allows us to train a single model to perform joint inference with multiple knowledge graphs. Our final contribution is a GAN architecture that uses the contextualized commonsense inference from stories as a generator; and that discriminates by scoring the generated assertions as to their plausibility. The result is an integrated system for contextualized commonsense inference in stories, that can controllably generate plausible commonsense assertions, and takes advantage of joint inference between multiple commonsense knowledge bases.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/Nov/2021
Major Revision
Review Comment:

Overall, I find the paper clearly written. This paper studies an important problem: contextualized commonsense reasoning, and it proposed 3 key contributions. 1. a hinting method that better guides the commonsense inference. 2. An embedding-based method for grounding commonsense assertions to stories. 3. A GAN architecture for inferring assertions from stories.
However, there are some weaknesses in the proposed methods:
1. The hinting method is not fully explained and only a paper reference is given
2. The human approval rate of the assertion grounding method is somewhat low (~65%)
3. The GAN model’s performance is slightly worse than the baseline generator.
4. Some more questions as detailed below can be better explained.
Questions to the authors:
1. Section 2.2, line 17, why do you report scores in such a way? Typically, we use the dev set to pick the best model and report scores on the test set.
2. In table 1, is the no hint setup referring to the original setup proposed by the GLUCOSE paper?
3. In section 3.6, the automated metrics seem to suggest that hinting is good. With hinting, the joint model is able to achieve comparable results as individual models, and overall results are much better than the no-hinting setting. But for human evaluation, the joint model with no hinting is getting the best results? which seem to contradict with the automatic evaluation?
4. Overall, I think it would be better to compare with previous systems, e.g. BART/GPT2 trained on ATOMIC2020.
I think the paper can be improved by addressing the aforementioned weaknesses and questions.

Review #2
Anonymous submitted on 28/Nov/2021
Major Revision
Review Comment:

Overall Comments

This paper proposes a complex framework for solving the contextual commonsense inference problem. Specifically, it contains three major components: (1) hinting; (2) Joint inference from multiple knowledge graphs; (3) adversarial training. In general, I think this paper still needs significant revision to meet the standard of this journal. Please see my detailed comments below.


The novelty of this paper is limited. Even though this paper proposes three modules to solve the commonsense inference problem, Most of them are borrowed from other tasks. The effects of these modules are also not very significant. For example, even though this paper spends several pages introducing the adversarial language model, it seems not very helpful on this task.

significance of the results

Some techniques like hinting have made significant contributions to performance improvement, while others are not. However, as this paper does not introduce too many details about the hinting, I cannot evaluate/verify if it can help the community. Moreover, many detailed experiences are missing. For example, where does the hinting come from, and what if you chose to use a different hinting strategy?

Quality of writing

The writing of this paper needs further polishing. Some arguments are too strong or not very clear. Please see my following comments:

Page 1, line 13: missing a “,”
Page 2, line 13: This argument is too strong (or the question this paper is trying to solve is too general). I do not understand why the proposed methods can solve the mentioned problems.
Page 2, line 17 (and many others): better to have a space before “[]”
Page 4, line 17: Personally, I like the idea of hinting, but this introduction omits too many details, and I cannot follow. For example, what does tuple refer to in your context?
Page 4, line 42: This argument is debatable. COMET also needs to be trained with a human-crafted commonsense KG. The success of COMET can also be that it utilizes the “semantic” information from language models to generalize the knowledge in those KG.
Page 5, line 1: Naming “merging several knowledge graphs” as joint inference might be misleading.
Page 5, line 8: This kind of conceptualization is not always correct. For example, one cannot walk a fish. But it fits all the patterns.
Page 6, line 46: Maybe I am missing. What vectorization tool is used?
Page 9, line 43: This argument needs more support. I do not think a 9.4% overlap can cause the failure of the joint reasoning. Have you tried to train the model with the data to remove the overlap part to see if that is the true reason your joint inference model is not working?
Page 14, line 20: Something seems to be wrong with the reference.

whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data

I do not see a readme in the provided repo.

whether the provided resources appear to be complete for replication of experiments, and if not, why


whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability

Seems so

4. whether the provided data artifacts are complete

It was not analyzed.

Review #3
By Julien Romero submitted on 08/Dec/2021
Review Comment:


In this paper, the authors consider the task of contextual commonsense inference, i.e., generating commonsense assertions related to a text. More precisely, they investigate how one could leverage multiple knowledge bases to boost the system's performance. To do so, they construct an artificial dataset by aligning the statements from several knowledge bases with sentences in stories. Then, they train a system to predict the alignment.
They introduce a hinting mechanism that helps the model control the generation during training. Besides, they use a GAN architecture to perform both the generation and the evaluation of the output.

All in all, this paper has good potential. The topic is interesting, and the methods introduced present some novelty. However, it does not go in-depth enough, contains many imprecisions, and leaves too many questions unanswered. This is a bit frustrating for a journal paper.

I found the paper's organization hard to follow and the main story confusing. For example, sections 2, 3, and 4 contain at the same time previous works, definitions, preliminaries, and experiments. This makes it hard to separate the concepts and link the sections together. Therefore, I would recommend a more standard approach, with Introduction, Previous Work, Definitions/Preliminaries, Sections for the main concepts and experiments. Besides, I found many typos and some LaTeX errors that should not be present in a journal paper. Finally, some figures are either useless or almost impossible to read.

The concepts are not well-enough introduced. Although used in many places, the hinting mechanism is quickly presented and relies on a paper that is not yet published. More generally, the authors say that something is future work in too many places, whereas it should be included in this paper. The definition of the GAN does not come with any formalism and ignores existing previous works on the subjects.

The authors do not compare with any baseline mentioned in previous works for the experiments. In particular, the comparison with ParaComet, a system close to the one presented here, is left for future work. Also, there is absolutely no example of what the system produces, making it hard to understand its limits and what it actually does. Finally, concerning the results of the experiments, there is no evidence of their significance. The performances of the GAN introduced in the paper are worse than a simple generator. Although negative results can sometimes be interesting, here it does not bring much. The interpretations of the results of the experiments are also not very convincing, composed of hypotheses, and lack proofs.

My review may seem negative, but I believe the paper's content can become something good if each topic is explored in more depth and is better presented. I just do not think it is mature enough for a journal paper. I leave below a more detailed review that, I hope, will help the authors.

Detailed review:
* The DropBox link is a bit of a mess. There is no README, a large part of it seems to be copy-pasted from other projects (ParaComet, KBERT, which is not mentioned in the paper), and it is hard to find what the authors did. The authors should consider using GitHub for hosting their code.
* Page 2, line 17: It is unclear what "sentence-level" commonsense inference means for COMET. This system directly generates triples without the help of any context.
* Figure 2 is not useful. I am not sure it is ok from a copyright point of view to copy-paste a figure from another paper, and here, it brings no additional value.
* The explanation of GLUCOSE at the end of page 2/beginning of page 3 is hard to understand. Consider rewriting it.
* Page 3, line 19: I am not sure why the problem of inferring knowledge that was not seen before is connected to having only one dataset.
* Page 3, line 23: "consisting" -> "using". The dataset is not the fusion of the three knowledge bases.
* 2.1 leaves many interrogations open, and I find it strange to introduce a new concept in a journal paper without going in-depth. In particular:
* The difference between a prefix prompt and a hint is not very clear. For me, a hint is a dynamic prompt, and this area seems to have already some previous work (a survey [1]). If it is close to existing works, consider not introducing a completely new concept/word.
* The authors claim that hinting is a hybrid between hard and soft prompting. For me, soft prompting means continuous prompts in the embedding space. Here, it seems that only "hard" words are given in the input.
* The training objective is not mentioned. Given the input and output, I would say that it is a prefix LM objective. Not all (generative) LMs are trained using this one. For example, GPT uses a standard language modeling objective to predict the next word given all the previous words. COMET is trained using this objective, and hinting does not seem to make sense in this case: the target triple will be both in the input and output depending on how we cut the sentence during training.
* A hint is first defined as "all but one of the elements" and later as "up to all but one of the elements." As the last occurrence suggests, can we have all but two of the elements?
* Is hinting also used during testing? It seems to be the case later.
* Is what GLUCOSE does (adding the required relation) a form of hinting?
* 2.1.1 defines contextual commonsense inference and introduces some previous work. I do not understand why it is in the same section as hinting (and in subsection 2.1, "What is hinting"). It should appear earlier in the paper: This is the main task of the article, and it is more fundamental than hinting. Here, it is squeezed between the definition of hinting and the related experiments.
* In 2.3, I like that the authors introduced additional metrics compared to GLUCOSE. However, the original paper reports BLEU scores that are very different. They are much higher. Why is it so? Besides, I do not understand why the authors do not directly compute the general BLEU score but instead report Avg. Max, Avg. and Median.
* In 2.3, what is the text generation method (beam, greedy)?
* What is the proportion of hints in the experiments? Does it impact the results?
* In 2.3, are the result significant (i.e., is the p-value less than 5%, for example)?
* Page 5, line 39: "We suspect ...". How can you verify this hypothesis?
* Page 5, lines 40-42: How do you make such a conclusion? I thought the experiments were done on a testing set where there should not be hinting.
* 2.3: Examples of outputs would have been nice.
* As the authors convert the predicates into text, open knowledge bases may be more adapted (e.g., TupleKB[2], Quasimodo[3], Ascent[4], GenericsKB[5])
* I would be extremely careful with the definition of general/specific. Commonsense knowledge bases are generally about concepts (thus the name ConceptNet), and concepts are general. "dog, HasA, tail" in ConceptNet is the same level of specificity as "PersonX drinks coffee, xFeel, energetic" in ATOMIC. The more specific version of these statements would be "Rex, HasA, tail" and "John drinks coffee, xFeel, energetic." Therefore, it is wrong to say that "ConceptNet does not have general specificity assertions." For previous work about dimensions of commonsense, please refer to [3,4,6,7].
* In 3.3, a table with examples would may things more explicit.
* I think the authors want to say here that they want to match GLUCOSE templates. However, ATOMIC only covers part of it (it contains only "Someone templates"). ConceptNet also includes "Someone templates," like here, although they are less complex than ATOMIC. So most of ConceptNet can be considered as simple "Something templates."
* For ATOMIC, it is not clear if Someone_A replaces PersonX to match GLUCOSE format.
* For ATOMIC, can we have examples of generated specific assertions?
* For ATOMIC, by making the statements specific before the alignment, do we not have a higher chance of creating an incorrect matching? For example, if we have "John drinks coffee, xFeel, energetic." but the sentence is "Bob drinks coffee to feel energetic," there is a problem. Here, examples of "Unacceptable alignments" would help understand what happens.
* The procedure to align sentences with assertions is interesting. It seems to me that the construction of this dataset is central in the paper and actually allows the joint inference. I wish the authors had considered and compared several alignment strategies. For example, why not align directly with sentences without considering stories? Besides, ParaComet provides a baseline. Why not compare with it?
* Again, we would like examples of alignments.
* In 3.5, why not consider each knowledge base alone?
* Page 8, line 34: why do we have hinting during the evaluation? Before, you said it was during training. Does not it bias the results?
* For ConceptNet, why not use directly ConceptNet5 instead of the dataset provided by [29]? How "train600k" is constructed is not mentioned in the original paper.
The authors should be careful when using ATOMIC-2020. This dataset is constructed partially using ConceptNet. That creates implicit dependencies.
* In 3.5, I am not convinced by the MTurk experiment. One hundred data points are very few, and on these data points, the authors do not consider the ones on which the raters (only 2) did not agree. How many were valid? What is the kappa coefficient? Is it really necessary to use Amazon MTurks to annotate only 100 points? Are the final results significant?
* Many figures do not seem to be referenced in the text, making it are to connect them with the content (e.g., Figure 5, 6, 7, 8, 9, 10).
* Figures 5, 6, 7, 8, 9, 10 are impossible to read:
* The text is way too small.
* The text is not "human-readable" (it contains underscores)
* The variations of colors are not explained.
* The colors disappear when printing in black and white.
* It is hard to know which lines are longer than the others.
* Consider replacing these figures with a table (like in Section 2.3). Besides, results with and without hinting should appear next to each other to compare them.
* Why are the metrics used here different from the ones in Table 1?
* The explanations in Section 3.6 do not convince me. The authors make many hypotheses and implications that are not proven or illustrated. Once again, there is no example at all. For instance, I do not see the link between the overlap of knowledge bases and performances: The goal of joint inference is to leverage the diversity of the knowledge sources.
* The knowledge bases used have different sizes. Does not it bias which knowledge base the model will prefer to generate? Some statistics here would be nice.
* In addition to alignment acceptance, it would be interesting to ask whether the statement is specific or not.
* Using a GAN for text generation is complex and was studied before. I find it very surprising that the authors just present a model without comparing it with anything done before. For example, the non-differentiability was partially addressed by the so-call Gumbel-softmax relaxation trick.
* Is the output of the K-NN meaningful? Can we have examples? What is it used for?
* 4.4. is not helpful as there is no experiment with it. It would be better to get more in-depth on other topics.
* Is 4.5 implemented in the model?
* In 4.6, which knowledge bases are used to construct the dataset?
* In Table 4, are the results significant?
* In Table 4, the rows of the last column should be the same as the alignment does not depend on the model. It shows the variability we have in the experiment and how hard it is to interpret the results.
* It would be nice to evaluate the output of the discriminator using MTurks to check if it makes sense.

* At the top of each page, the main author's name is missing.
* Some citations are missing in the text (e.g., page 14, line 21 and page 16, line 1). I encourage the authors to read the errors given by LaTeX when generating the PDF.
* Sometimes, there is a space before a citation. Sometimes, there is not.
* Page 2, line 20: "i.e." not adapted here and can be removed.
* Sometimes, "use" might be more appropriate than "utilize".
* In some places, the authors forgot to capitalize "figure" or "section."
* Page 2, line 3: refer as -> refer to?
* Page 3, line 5: "etc." can be removed.
* Page 3, line 23: Too many commas
* Page 7, line 46: Remove the "of".
* More generally, in doubt, use an automatic tool to check the grammar of your paper and take time to re-read the final PDF.

[1] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586
[2] Mishra, B. D., Tandon, N., & Clark, P. (2017). Domain-targeted, high precision knowledge extraction. Transactions of the Association for Computational Linguistics, 5, 233-246.
[3] Romero, J., Razniewski, S., Pal, K., Z. Pan, J., Sakhadeo, A., & Weikum, G. (2019, November). Commonsense properties from query logs and question answering forums. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (pp. 1411-1420).
[4] Nguyen, T. P., Razniewski, S., & Weikum, G. (2021, April). Advanced semantics for commonsense knowledge extraction. In Proceedings of the Web Conference 2021 (pp. 2636-2647).
[5] Bhakthavatsalam, S., Anastasiades, C., & Clark, P. (2020). Genericskb: A knowledge base of generic statements. arXiv preprint arXiv:2005.00660.
[6] Chalier, Y., Razniewski, S., & Weikum, G. (2020). Joint reasoning for multi-faceted commonsense knowledge. arXiv preprint arXiv:2001.04170.
[7] Ilievski, F., Oltramari, A., Ma, K., Zhang, B., McGuinness, D. L., & Szekely, P. (2021). Dimensions of commonsense knowledge. arXiv preprint arXiv:2101.04640.