Graph4Code: A Machine Interpretable Knowledge Graph for Code

Tracking #: 2575-3789

Authors: 
Ibrahim Abdelaziz
Julian Dolby
James P McCusker
Kavitha Srinivas

Responsible editor: 
Ruben Verborgh

Submission type: 
Dataset Description
Abstract: 
Knowledge graphs have proven extremely useful in powering diverse applications in semantic search and natural language understanding. Graph4Code is a knowledge graph about program code that can similarly power diverse applications such as program search, code understanding, refactoring, bug detection, and code automation. The graph uses generic techniques to capture the semantics of Python code: the key nodes in the graph are classes, functions and methods in popular Python modules.Edges indicate function usage(e.g., how data flows through function calls, as derived from program analysis of real code), and documentation about functions (e.g., code documentation, usage documentation, or forum discussions such as StackOverflow). We make extensive use of named graphs in RDF to make the knowledge graph extensible by the community. We describe a set of generic extraction techniques that we applied to over 1.3M Python files drawn from GitHub, over 2,300 Python modules, as well as 47M forum posts to generate a graph with over 2 billion triples. We also provide a number of initial use cases of the knowledge graph in code assistance, enforcing best practices, debugging and type inference. The graph and all its artifacts are available to the community for use
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ben De Meester submitted on 19/Nov/2020
Suggestion:
Major Revision
Review Comment:

My review is in three parts: first solely taking into account the reviewing guidelines and dimension, secondly personal remarks on the current text, thirdly typos and rephrasings.

To summarize: many things are lacking in this paper to pass the review criteria. That said, the approach seems reasonable, the use cases exciting, and many of my comments can be addressed by clarification and thus do not influence the core contribution of this paper: the GRAPH4CODE knowledge graph.

Even though I hope all my comments are addressed, I highly value the following ones:

- reproducibility: publish examples of your queries, results, and knowledge graph, so that users can play with toy examples and you can prove usefulness of your dataset
- make sure your created vocabulary is online available, documented, and consistently discussed throughout the paper
- Clarify the long-term plan: do you have a maintenance and update plan?
- Clearly position your _dataset_ with the state of the art, the current related work discussion is too shallow.
- Have a proper conclusion

The highest risk I currently see is that there doesn't seem to be third-party use.
Combined with the lack of a long-term plan,
I'm afraid this work will get abandoned too soon to make an impact.

Feel free to get back to me if you would have further questions
Ben De Meester
ben (dot) demeester (at) ugent.be

#### Review criteria

##### Typical dataset information

The paper lacks quite some typical dataset info:

- [V] Name: GRAPH4CODE
- [X] URL: after clicking through, I assume the canonical URL of the dataset should be https://archive.org/download/graph4codev1, but this is never mentioned in the paper
- [X] version data and number: not available, nor any changelog
- [X] licensing info: I could not find anything specific for the dataset, if it is the same as for the repo, it should be mentioned explicitly
- [X] availability: how long-term is archive.org hosting? What kind of guarantee will the authors give that this dataset remains maintained? Is there an update strategy?
- [V] topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.: Github, StackOverflow, WALA, etc.: this is clear.
- [/] metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressiveness, growth: there is some discussion on use of established vocabularies, but argumentation on why using these specific vocabularies is lacking
- [/] examples and critical discussion of typical knowledge modeling patterns used: there are some examples and some discussion of knowledge modeling patterns, but not that thorough
- [X] known shortcomings of the dataset: There are small hints at potential shortcomings, but nothing substantial is discussed

##### (1) Quality and stability of the dataset - evidence must be provided.

There is very little discussion on the quality and stability of the data: there is no maintenance/update plan, there is no discussion on the quality of the processing, there is no assurance that the knowledge graph is correctly built, except for the anecdotal evidence via the use cases.
The fact that the own-created datamodel is not resolvable and is not consistenly discussed throughout the paper,
sadly gives the feeling the dataset as well might not be of high-quality.
The lack of query results and the large cost to reproduce (the hardware requirements are high-end) gives me further doubts.

##### (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided.

The usefulness of the dataset is clear and very well argumented via the use cases. Some proof could be included more explicitly (finding the right results required some website browsing, which, imo, should not be needed and all important links should be given in the paper).

Third-party use is not discussed, and as far as I can see, is currently non-existing.

##### (3) Clarity and completeness of the descriptions.

The used vocabularies are not resolvable, so their quality is unclear.
Also, no comprehensive snippet of the GRAPH4CODE dataset is available (only downloads of at least 300MB), so the quality of the actual dataset is also unclear.

#### General Remarks

##### 1. Introduction

I find the introduction's motivation a bit weak. The example of the 100 papers does not really give an argument as to _why_ Graph4Code is needed.

Also the illustration is anecdotal: how sure are you that 'there is no easy way to tell that the model.fit...' is in fact problematic for developers? I would also appreciate including the actual link to the stack overflow page, to ease verification of your claim.

"Such an abstract representation would ease understanding semantic similarity between programs, making searches for relevant forum posts more precise" --> needs citation

"Yet most representations in the literature have been application-specific, and most have relied on surface forms such as tokens or ASTs" --> needs citation

"that enable applications such as in Figure 1." --> please clarify: what do you mean by 'such application'? What kind of feature did you precisely envision? I can imagine you mean some kind of abstract representation, or examples that are automatically built/adapted based on the user's context, but it's hard to exactly understand what your mean. Adding a third panel to Fig. 1 that would demonstrate what you envision would greatly clarify things.

You mention the term "program" in this section, but it is not entirely clear what you mean by "program": how do you detect when something is an individual program, e.g., are test cases also programs, are snippets programs? Do you support partial programs? As far as I understand this is part of the WALA analysis tool, but I would advise a short clarification with a link to where we could find more information.

"To our knowledge, this is the first knowledge graph that captures the semantics of code, and it associated artifacts" --> There are certainly (semantic) alternatives out there, you even mention some in your Related Work section. I would suggest to better nuance this claim.

There is no clear discussion on why exactly _these_ ontologies are reused

I strongly suggest finding better evidence/related work to more substantially motivate the introduction.

Given this is a Dataset Description submission, I am not sure about how you list your contributions: I have the feeling (and this comes back when reading the paper further) that your main contribution is your dataset, and less the "model" and "approach".

Why are your techniques generic?

Why is WALA state of the art?

##### 2. Ontology and Modeling

The purl links are not resolvable (last try at 2020-11-18)

I don't understand the .ttl files you refer to at : these are not valid ttl files. It seems they need to be processed, but also on they not seem to be correct.

Generally, I feel more clarification as to how the model got built is needed. Currently, it feels as if a lot of ad-hoc decisions were made, and no modeling methodology was followed.
This can be partially forgiven since this is a Dataset Description paper, so the largest contribution should be in the resulting dataset, but still some clarifications are too superficial to evaluate the rigor of the work (e.g. you partially argument why schema.org, but not why SIO). Although I not necessarily question your choice of ontologies and modeling, I would like a better argumentation and more consistent structure as to why certain ontologies were chosen.

How flexible is your approach? I.e., how easy is it to change the used ontology? How is the actual knowledge graph generated?

##### 3. Mining Code Patterns

It is not clear to me what the contributions of WALA are. You mention "Figure 4 illustrates the output of program analysis. For this, we use WALA [...]", but I don't see the distinction between what exists and what you contribute:
does this mean WALA creates the code analysis, data flow, control flow, and creates the semantic representation? If so, I would clarify this and also remove the 'language neutral approach' as a contribution. If I am mistaken, I would suggest clarifying which parts come from WALA, and which parts are your contribution. If you create the semantic representation as a direct mapping based on the WALA output, please also clarify (but then I would also not say your contribution contains a language neutral approach). I do agree some explanation of WALA needs to be included to keep the publication self-contained, but you can just clarely state this imo.

Can you clarify/argument why http://purl.org/twc/graph4code/ontology/ contains the classes, and http://purl.org/twc/graph4code/ contains the predicates of the model?
Also the usage of both g4c and graph4code prefixes which on a first glance should denote the same is imo unnecessarily complicating things.

"we assume any call simply returns a new object, with no side effects" --> Can you include a discussion on the consequences of this assumption? How do you know that this assumption is relevant and doesn't give a too wrong representation of reality?

"Since Python does not have methods per se" --> I find this description unclear, please clarify. Is it because Python is a dynamic programming language? Because of Object runtime alteration? Reflection? I would suggest to improve the phrasing and refer to the right kind of references.

##### 4. Linking Code to Documentation and Forum Posts

"We found 403 repositories using web searches on GitHub for the names of modules" --> I'd love if you could provide/publish the actual list somewhere

In the first listing you use `graph4code:Function`, whereas you mention `g4c:Function` in Section 2. This adds to my confusion above.

I do not fully understand why you use `skos:definition` here: this hints as if a Parameter and a Return are part of a SKOS taxonomy (even though there are not necessarily semantic restrictions defined in the SKOS ontology). What is the reason for this, and not using, e.g. dcterms:description, which is more generic?

In general, I really enjoy the fact that you generously link to your code, e.g. the bigquery and elastic search query. I would suggest to link to the permalink instead of the generic master reference, as otherwise this link will become a 404 if you ever rename relevant folders or files.

For the last listing, you mention " the following question is linked to SVC class", it would be nice to explicitly mention _how_ :) (I assume schema:about?).

The website mentions WhyIs as an important component to generate the knowledge, I am a bit confused why this is not shortly discussed in this paper,
give you put your approach as a contribution.

##### 5. Knowledge Graph: Properties and Uses

Your introduction is off: it's as if the Related Work section used to come sooner.
In general, I don't think your introduction is strong: it has some unclarities (see my next paragraph) and I don't see which point it tries to bring across.
Given this is a "Data Description" submission, I put high value on this section (i.e., it should be the main contribution), so this introduction is very important.

The way I read it, you say that WALA's popularity proves that multiple applications exist for an analysis-based representation of code:
on the one hand, I'd like to see some proof of this popularity (e.g., by mentioning some of these "multiple applications"),
on the other hand, I don't think I understand the sentence. It feels like it reads as a circular reasoning argument: its popularity proves its popularity.
So please prove WALA's popularity, and give a good argument why that's important.

"To our knowledge, GRAPH4CODE is the first attempt to build a knowledge graph over a large repository of programs and systematically link it to documentation and forum posts related to code." --> Well, yes, but I would suggest to nuance and soften your claim, as there certainly exists research on associating informal documentation to source code (e.g., https://doi.org/10.1109/ICSE.2015.212) and extracting structured information from a large repository of programs (both semi-structured https://doi.org/10.1016/j.scico.2012.04.008 and SemWeb structured such as CodeOntology, which you handle in your related work).
I'd like to see a more detailed comparison to the related work, to clearly state what differentiates GRAPH4CODE with some of the existing works.
You could eg include and discuss a comparative table in your related work section and refer to that in Section 5.

"We believe that by doing so, we will enable a new class of applications that combine code semantics as expressed by program flow along with natural language descriptions of code." --> feels more like conclusion statements, which should go in the conclusion section imo.

I think you can greatly reduce the statistics discussion. On the one hand, you mention half of the statistics already in Section 1, on the other hand, your table already includes all the figures.
A small discussion of the statistics themselves would be much more relevant: how much is "278K functions, 257K classes and 5.8M methods"? Do you have any reference to prove that you have detected a significant part? How much of the code you extracted in Section 3.1 do you successfully process? Please put these numbers in context, it's better to know your approach and knowledge graph describes 5% of the input well, than not knowing at all.

For the queries, please include either links to show them (and the results) in action, or provide links to the results stored in, e.g., static files.
If I understand correctly from the documentation (https://github.com/wala/graph4code/blob/master/docs/load_graph.md),
the hardware requirements are not exactly household, so I would expect some proof of reproducibility without having to set-up the system myself.
Also, the graph in your example (purl.org/twc/graph4code/docstrings) doesn't resolve (2020-11-19), either change them to example.org links to clarify these are dummy links,
or make sure they resolve.

Am I right in assuming that the docstrings is a single graph that contains all information of the inline documentation and introspection?
That confuses me: in Section 3 you state "Each program is analyzed into a separate RDF graph, with connections between different programs being brought about by rdfs:label to common calls (e.g.,sklearn.svm.SVC.fit)", so I assumed that each program was analysis in a separate graph, but your SPARQL query seems to say otherwise.
Please clarify and argument correctly both in Section 3 and here. It would help if your snippet in Section 4 clearly states in which graph this data is found.
Also clarify in which graph the web forum links and static analysis metadata are stored.

Similar in your second snippet of Section 5, please clarify `?g2`

Having looked into https://wala.github.io/graph4code/, I strongly suggest linking again here, to clarify these query snippets are further explained online.

I would clarify "For other code assistance use case, please see [15].": having read [15], to me, a more clear description is "These and other use cases are demonstrated at [15]".
For the reader, I would also clarify the distinction between CODEBREAKER and GRAPH4CODE: I have the feeling GRAPH4CODE is mostly a new name, of which the use cases were demonstrated in [15], but I am not sure.

I find "5.4.1. Enforcing best practices" a very strong section.
To make your case stronger and your paper shorter, I would suggest adding the link to the online code snippet instead of the image, and shrink the image to contain only the most relevant parts. The link allows readers to look into more detail and gives more confidence.
For the end of the section, I strongly suggest adding the link to https://wala.github.io/graph4code/use_cases.html#case2

For 5.4.2, I would again suggest referring to the actual query and result set to prove reproducibility, and add the link to the actual stackoverflow question.
Also, I would include the link to the demonstration video.

##### 6. Related Work

Section 6.1 is well written en thorough, however I find the concluding paragraph superficial.

"to create a more comprehensive representation of code" --> how do you know? Can you give arguments why?

"a multi-purpose knowledge graph" --> why is it multi-purpose and why are existing works single-purpose? can you prove this?

I find 6.2 far too brief, given this is a "Data Description" paper and thus the dataset is the most important contribution.
I gave some examples of relevant related work above.
Given the focus of this paper, I expect this to be much more thorough, clearly distinguishing GRAPH4CODE with the state of the art

"programs that behave similarly can look arbitrarily different at a token or AST level due to syntactic structure or choices of variable names" --> this hints to the kind of argumentation I would expect when you say "to create a more comprehensive representation of code" in 6.1

When discussing [34], please clarify why the difference with your approach is important.

##### 7. Conclusion

The conclusion is very brief, not clearly aligned with the contributions, and does not have concluding remarks: what are the advantages and disadvantages of GRAPH4CODE, is there any future work?

#### Typos

##### 1. Introduction

- we build Graph4Code -> We built
- e.g. -> e.g.,
- and there many open -> and there are/exist many open
- "RDF provides good abstractions for modeling individual program data such as named graphs" --> I would revise, this sentence is a bit dense
- "SPARQL, the query language for RDF provides" --> "SPARQL -- the W3C recommended query language for RDF -- provides"

##### 3. Mining Code Patterns

- components is used -> components are used
- node show the read -> node shows the read
- Note that this type of modeling -> I assume you mean using named graphs, please clarify

##### 5. Knowledge Graph: Properties and Uses

- "the program analysis tool used to generate the analysis based representation" --> The second mention of "analysis" reads weird. Could you clarify or rephrase?
- "knowledge graph with a 2.09 billion " --> knowledge graph with 2.09 billion
- inside and IDE -> inside an IDE
- 5.4. Code Assistance: Next Coding Step -> shouldn't this be 5.3.1?

##### 6. Related Work

- a mostly intraprocedural --> a word is missing