Knowledge Extraction from source code based on Hidden Markov Models

Tracking #: 1943-3156

Azanzi Jiomekong
Gaoussou Camara

Responsible editor: 
Guest Editors Knowledge Graphs 2018

Submission type: 
Full Paper
Large software systems evolve rapidly and these evolutions are usually integrated directly into source code without updating the conceptual model. As a consequence, implementation platforms evolve faster than business logic. Indeed, when extracting knowledge to enrich or build an ontology, business logic is not always a complete data source. To solve this problem, some authors have suggested to adopt an ontology learning approach in order to extract knowledge from the source code. In this paper, we show how to realize this task using Hidden Markov Models. Experiments on EPICAM(a tuberculosis surveillance system developed in JAVA) shows the relevance of this approach.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 23/Aug/2018
Major Revision
Review Comment:

In this paper, the authors present a method for extracting knowledge from source code using Hidden Markov Model. The knowledge extraction from source code aims to solve a problem that evolutions in source code are not updated into the conceptual model, which hinders software maintenance and management.

I agree that the problem which motivates this paper is critical in many fields, especially for large open-source software systems. Also, there is some novelty in this paper in that previous studies have tried to extract knowledge from metadata or documentation, but not from the source code.

However, as I will elaborate below, additional experimental results and arguments are necessary to support the effectiveness and advantages of extracting knowledge from source code.

First, I think additional experimental results are necessary to convince readers that statistically extracting knowledge from source code using HMM actually solves the problem, and also is more efficient and easier than other methods.
- Even though the authors have extracted knowledge from two systems and showed its precision, recall, and how relevant the extracted terms are, the number of systems is too few to generalize and the domains of the systems are also limited. Experiments on systems in other domains that are developed by diverse developer groups will help convince readers that HMM works well on this problem.
- It seems to me that the accuracies, especially recalls, are not good enough to argue that HMM is the best method for knowledge extraction. In order to show that HMM beats other methods in terms of performance, comparison with other methods (e.g., parser-based approach) should be added to this paper.
- Experiments to show other advantages of HMM would also help convince readers. Training an HMM model based on one system (e.g., EPICAM) and extracting knowledge using this model from other systems (e.g., Hadoop) can show HMM’s genericity, and some user studies with knowledge engineers will help prove its ease of use.

Also, I think concrete arguments on the effectiveness of knowledge extraction from source code should be added to the paper.
- According to Table 11, extracted terms are not necessarily relevant to the domain, and for rules, there are even more irrelevant terms than relevant terms. This does not intuitively convince me to accept that knowledge extraction from source code is effective for knowledge engineers. I think some arguments to justify this result should be included in this paper.
- In section 4.4.2, the authors describe that all extracted terms were in the gold standard and 16.18%, 7.43%, 13.94% of all candidates were new terms. Although the percentages seem to be low compared to the costs that will be made by the knowledge engineers such as filtering out irrelevant terms, they do not explain the impacts of the new terms. Therefore, if the newly extracted terms turned out to be significant for conceptual models, some examples or explanations should be added.

Besides these major concerns, these are some minor questions and typos that I found from the paper.

Below are some minor questions that I encountered.
- The authors use ontoEPICAM to record terms. If there is no already built ontology for a program, how can I verify the terms before domain experts verify them?
- The authors use Java Code Conventions to separate the terms. How can I do that when there is no convention specified for the program like in large open source projects?
- In section 4.3., the authors calculate the precision and the recall of their model, but I could not find how they obtain the ground-truth for the evaluation. If they manually inspect the terms and source code, what were the criteria? If not, how do they generate the ground-truth from the source code?

Below are some typos or calculation errors that I found, and there might be several more in the paper.
- P16: we verify if by removing … -> we verify it by removing ...
- Table 11: 07,89% -> 07.89%
- Table 11, Properties row: 81.44% + 18.59% is greater than 100%. 38741(18.59%) is even greater than 38355 (81.44%).
- P18: A simple algorithm enable to remove redundancies terms automatically.

Review #2
By Anoop Kumar submitted on 18/Sep/2018
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The authors propose a method to learn ontologies by extracting knowledge from source code using hidden Markov models. They claim the relevance of the proposed approach on EPICAM. Based on the described evaluation, it is unclear how good the proposed approach is. It would help, if authors compare the proposed approach to any existing baselines.

Analyzing source code and generating a representation, which in this case is an ontology, is not new. It has been done in academia and by industry for quite some time and there are many tools to that. Did authors review the output of Java grammar parsers on the code? This may be the first step, since Java is structured language with well-defined grammar. It would be nice to compare proposed approach to grammar parsers. Next, since model driven projects were considered, it would be useful to review what entity-relationship (ER) and UML reverse engineering tools produce and either demonstrate how the proposed approach differs or can use the output from such tools in preprocessing steps. Employing HMMs in this way is probably original, but authors need to make a strong case that it is necessary and discuss the limitations non-probabilistic grammar based approaches.

It is not clear that the results are significant. The approach is not compared to any baselines. How is a proposed 4 state HMM better at extracting the “TARGET” labels than just the grammar? Evaluation section describes comparison to gold standards which was constructed by authors, which is very likely to influence their HMM based extractors. A good evaluation would be to compare the extracted knowledge to a “held out” ontology. The authors assume that code is written according to good programming practices, which is not always true. Further, the labeling process to construct training data is not described well. The tables report the basic observation and transition statistics in the code. It is not clear why EM was not used to train the HMM. The authors should report the performance of HMMs on basic labeling task and then evaluate it for knowledge extraction.

The authors claim that the rules very validated without giving any details. Same with GeoServer evaluation, how would a simple rule or grammar based extractor compare. While the precision-recall numbers are high, it is difficult to assess the significance.

The paper is easy to follow. Authors did a good job introducing the problem and providing background information. The method section could have been better. Instead of listing the code, they can describe the algorithm using pseudocode or flowcharts. The screenshots are difficult to read and don’t make any significant point.

The paper needs significant work to be accepted. It needs to address the above concerns and develop a rigorous evaluation approach to demonstrate that HMMs really help. In particular, Section 4 needs to make a case that HMMs are better than a naive approach.

Specific comments:

JAVA and Java is used in paper. It should always be Java when referring to the programming language.

Page 14: Explanation of the Viterbi algorithms through equations is not clear. It might be helpful to draw the table and show how the values in the cell are calculated. Once you do that Listing 4 and 5 of source code is not necessary.

“Statistical methods are more computable, general, and scalable [12]. Then, they are easy to adapt from one source code to another.”– Don’t agree on this claim. Method is only one aspect, data and the problem to be solved play an important role in establishing computability, generalizability, and scalability.

Though EPICAM is developed in JAVA, we will exploit the structure of the JAVA source code [39] to extract knowledge in order to build this ontology. – What is the point here?

There are several methods to determine the parameters of a HMM: statistical learning on data, Baum-Welch algorithm, Expectation-maximization algorithm, etc - repeated multiple times. Why isn't EM used?

Typos and grammatical corrections
Page 2, Paragraph 2
an ontology that model the domain knowledge may be a good solution -> an ontology that models the domain knowledge may be a good solution
Page 2, Paragraph 3
Studer and al. -> Studer et al.
Page 2, Paragraph 4
In fact, the domain evolves, -> In fact, as the domain evolves,
Page 5, Paragraph 7
results produce by the algorithm -> results produced by the algorithm
Page 5, Paragraph 11
functional evaluation consist in looking -> functional evaluation consists of looking
Page 6, Paragraph 1
main experts who have to judge to what extend -> main experts who have to judge to what extent
Page 10, Last paragraph
We use OWL language -> We use OWL (“language” is redundant here)
Page 11, Last paragraph
must be done with updating the conceptual model -> must be done to update the conceptual model
stakeholders to communicate -> stakeholders to be communicated
Page 13, Paragraph 1
code that permit to -> code that permits us to
Page 14, Paragraph 1
The logic is not clear. Why do you extract knowledge from any Java application and how is it used in EPICAM?
It consists to break down the calculations into -> It consists of breaking down the calculations into
Page 16, Paragraph 1
All the terms was browsed -> All the terms were browsed
Page 18, Paragraph 2
are used temporary -> are used temporarily
Page 18, Paragraph 3
simple algorithm enable -> simple algorithm enables
Page 18, Paragraph 8
The evaluation permit to judge -> The evaluation enables us to judge
Because these knowledge can -> Because this knowledge can

Review #3
By Sebastian Lehrig submitted on 22/Oct/2018
Review Comment:

The paper describes an approach to extract knowledge (concepts and concept relationships) from Java source code. In this approach, Hidden Markov Models (HMMs) are trained on a training set of Java files such that they can output concepts and relationships of interested. Once trained, these HMMs can be used for extracting knowledge from a targeted Java project. The extracted knowledge can eventually be used for constructing an ontology specific to the targeted Java project. The authors have evaluated their approach with two example Java projects and by comparing extraction results to reference results gained from domain experts.

On the positive side, the implementation of the HMMs-based extraction is described in great detail and appears to allow the reproduction of extraction results. Moreover, several artifacts (implementation, Java source code to be extracted) are publicly available, which strengthens this impression.

However, there are multiple issues on the negative side that undermine my vote to reject the paper in its current form. I am going to describe these issues along the three dimensions of "originality", "significance of the results", and "quality of writing".

(1) originality
There is a rich corpus of related work on reverse and reengineering software based on its source code but the authors do not sufficiently relate to such literature, for example, "Reengineering component-based software systems with Archimetrix" by von Detten et al. A difference is that such related work often extracts knowledge into architectural models like UML models, however, the way this paper describes the output can be seen as subsets of the UML: for example, the discussed relationships types "association", "taxonomy", and "attributes" are all part of UML class diagrams. Without further clarification I am strongly doubting originality here.

(2) significance of the results
- The discussion of potential threats to validity is insufficient. For example, I would have expected discussions on the relevance on the programming paradigm (object oriented like Java vs. functional vs. procedural). Also the selection of HMMs should be discussed, e.g., the impact of the Markov property on results.
- The usefulness of the final outcome is questionable and remains unmotivated. It is unclear to me in which cases domain experts profit from the ontological knowledge representation opposed to a knowledge representation, for example, using the UML - a language specifically designed to communicate knowledge on software between involved stakeholders. The level of abstraction in the ontological representation seems to be close to the one of the source code, which potentially complicates higher-level discussions between stakeholders. In this context, I am also missing a discussion on the validity of evaluation results, e.g., a higher number of identified concepts compared to the gold standard is evaluated positively, which ignores the potential positive impact of the abstraction level.

(3) quality of writing
- The fundamentals on ontology learning allow to classify the proposed approach but most described concepts are irrelevant in the remainder of the paper. For example, in Sec. 2.2 ("Data sources for ontology learning") and Sec. 2.3 ("Ontology learning techniques"), only a small subset of described concepts ("source code" and "HMMs") are relevant. In the end, I was therefore wondering why I had to go through all of that content.
- Until section 4.6, I was wondering why the authors have chosen an approach based on regular expressions, given that bullet-proof regular expressions are hard to specify. In section 4.6, the authors finally discuss parser-based approaches as an alternative - but that came too late for me and did not convince me either, leaving me wondering why the authors did not use existing tools like MoDisco ( to directly work on the level of an abstract syntax tree. With a tool like MoDisco, the claim that it is necessary to "modify the source code of a parser" (Sec. 4.6) appears to be wrong.
- There are lots of details on the implementation of the knowledge extraction, going down to the level of concrete implementation code. This appears too detailed and should only belong to supplementary material. I would have preferred more discussions of related work and results and a more thorough motivation instead.
- The paper is in a bad shape when it comes to style and spelling. For example, I counted 22 typos only on the first four pages (see "Typos" below).

Minor Details:
- Abstract: "business logic" is part of code as well; please clarify
- Sec. 1.:
- unclear what is meant by "conceptual model"
- "an ontology that model the domain knowledge may be a good solution." - evidence?
- "Source code is rarely used." - evidence?
- Sec. 2.1:
- "Individual" should be defined after "Concept" because it is already used in subsequent definitions.
- "will be extracted from data sources": please write in active voice (who is extracting here? the authors vs. everybody who wants to identify ontologies?)
- Sec. 2.2.3: "In knowledge base, one can generate discovered rule as input to develop a domain ontology" -> I do not understand this sentence; also potential typos make it hard to understand it (is it "a knowledge base" and "a discovered rule"? and is the "discovered rule" _used_ "as input" opposed to _generated_?)
- Sec. 3.4.2:
- "Concepts, properties, axioms, and rules are usually arranged differently in the source code." - Why? What are your hypotheses and how do they relate to HMMs?
- "We assume that before entering the first word, the programmer reflects on the label of that word and as a function of it, defines the label of the next word and so on." - Is this a valid assumption? What happens if a programmer first writes the TARGET and then the PRE part?
- "ImogEntityImpl" (and others) - inconsistent with Fig. 2
- Sec. 4.2.2: "Recoding terms and rules" - is there a protocol for the work with the export? Would greatly help reproducibility...

Typos (considering only the first 4 pages):
- one may follows -> one may follow
- a group of domain expert -> a group of domain experts
- Building domain ontologies require -> Building domain ontologies requires
- of conceptual model -> of associated conceptual models
- in GATE software -> in the GATE2 software
- written in Java programming language -> written in the Java programming language
- age, of type Integer -> age of type Integer
- of concept Person -> of the concept Person
- Individual is instance of concept -> Individual is an instance of a concept
- a knowledge engineer conduct -> a knowledge engineer conducts
- domain evolves -> domains evolve
- the knowledge provide by domain experts -> the knowledge provided by domain experts
- discussion forum posting, specification, analysis and conception document -> discussion forum postings, specifications, analysis and conception documents
- extracting formal specification -> extracting formal specifications
- from database schema -> from a database schema
- that already reflect -> that already reflects
- is that, they -> is that they
- to use Ontology UML Profile -> to use the Ontology UML Profile
- with Ontology Definition Meta-model -> with the Ontology Definition Meta-model
- are closed to the terms -> are close to the terms
- can for example used -> can for example be used
- Pattern-based/Template-driven approach allows -> Pattern-based/Template-driven approaches allow