Predicting Concept Drift in Linked Data

Tracking #: 748-1958

Authors: 
Albert Meroño-Peñuela
Christophe Guéret

Responsible editor: 
Guest Editors EKAW 2014 Schlobach Janowicz

Submission type: 
Conference Style
Abstract: 
The development and maintenance of Knowledge Organization Systems (KOS) such as classification schemas, taxonomies and ontologies is a knowledge intensive task that requires a lot of manual effort. Librarians, historians and social scientists, among others, constantly need to deal with the change of meaning of concepts (i.e. concept drift) that comes with new data releases and novel data links. In this paper we introduce a method to automatically detect which parts of a KOS are likely to experience concept drift. We use supervised learning on features extracted from past versions to predict the concepts that will experience drift in a future version. We show that an existing domain-specific approach for predicting the extension of concepts can be successfully generalized to predict concept drift in KOS in a domain-independent manner. This result is confirmed by our experiments on a variety of datasets (encyclopedic, cross-domain, and socio historical), enabling the creation of a generic tool to assist knowledge experts in their curation tasks.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
[EKAW] reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 22/Aug/2014
Suggestion:
[EKAW] combined track accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject

0

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)

3

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

5

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

4

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

4

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present

3

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

2

Review
Please provide your textual review here.

This paper asks if it is possible to predict "concept drift" in classification taxonomies such as can be represented in RDFS/OWL, SKOS, etc. Concept drift generally refers to a change in the meaning or usage of a concept over time (e.g., between versions). The authors motivate an analysis of concept drift as being important in the maintenance/versioning of a taxonomy. Three aspects of concept drift are considered: the labels of the concept change, the intensional definition of the concept changes where properties used to define the concept change (e.g., skos:broader, rdfs:subClassOf, etc.), the extension of the concept changes where the number of instances changes. Drift occurs in dynamic knowledge-bases. The authors consider two such types of dynamic knowledge-bases: closed knowledge-bases where changes have occurred in the past but the latest version is now static, and open knowledge-bases where changes are on-going. For both types of knowledge-bases, the authors consider the refinement of concepts by looking at the "coherence" of intermediate versions with respect to previous versions. For open knowledge-bases, the authors consider the problem of predicting which concepts will drift. For the tasks of refinement and prediction, the authors propose a machine learning framework where older versions of a knowledge-base and used to train a classifier that is tested on the next version; a newer version that the test dataset is used to verify predictions. The authors use the training, test and evaluation versions as input for a WEKA machine-learning phase. Features such as the descendants and ancestors of the concepts at various levels, number of instances, number of instances including sub-concepts of a certain depth, etc., are considered. Threshold-based feature selection is then applied and models trained. For evaluation, the authors consider two datasets: the first is the DBpedia ontology with 8 versions spanning from 2009 to 2013 (an open dataset); the second is CEDAR, a historical Dutch taxonomy of occupations for censuses, with 8 versions spanning from 1849 to 1930 (a closed dataset). Refinement is evaluated for both datasets considering a varying number of versions. Prediction is evaluated for DBpedia considering all but the last version.

The paper tackles an interesting area. Being able to predict which concepts are the most likely to change has consequences for versioning and maintenance of the knowledge-base. I like the general questions that the authors raise and the direction of the work. The paper is on-topic for EKAW and generally well-written. Likewise, evaluation seems appropriate for the questions raised.

My main concerns with the paper -- holding me back from making a more positive recommendation -- are (i) the coarse nature of the core notion of concept drift, and (ii) a lack of some crucial details.

With respect to the coarseness of concept drift, as defined, it seems that although the intensional, extensional and label similarity measures map to a real value between [0,1], when instantiated, they pretty much become binary values. Namely, my understanding is that a concept drifts intensionally if some part of the definition changes, it drifts label-wise if a label changes, but most problematically, it drifts extensionally if the number of instances changes. Focusing on the extensional drift, all of the instances of the classes could change but as long as the cardinality remains the same, the concept is not considered as drifting extensionally; one could validly argue that in practice, this is probably not such a major concern since if the instances change, it seems likely the cardinality will too. A larger concern is that even a change of one more/less instance in a class with a potentially very large legacy extension will be considered a drift, and thus the same as a class with an arbtirary-fold increase or decrease in instances. Likewise, with such a coarse notion of drift, I'm left wondering what concepts would not drift: if a change of one instance more or less is considered a drift, I could only imagine that some pretty obscure concepts would not be drifting in a dynamic knowledge-base.

This brings me onto my second concern, which is that the paper lacks in some crucial details. For example, I still have little idea what the goal of "refinement" is. Perhaps I've missed something but all I've found in the paper is a statement to the effect: "The purpose of refinement is to check the coherence of an intermediate version with respect to previous versions." But I cannot find where coherence is defined (and I'm guessing, e.g., it's not the same coherence as defined in the ontology debugging world). Likewise the goal of refinement is not intuitively obvious in the context of concept drift; I'm struggling to even hazard a good guess at it. Half of the evaluation presents results on "refinement": again, maybe I miss something, but at the moment I can't say what it is. It's a similar problem with "prediction": prediction deals with concept drift, but is there a particular type of concept drift that is predicted? Or is the prediction merely that there is a drift of some kind? If so, I could imagine that lots of concepts would quite naturally drift by at least one in the number of instances. Hence I would have been interested to see some raw numbers on the actual number of concepts that drift (per the different types of drift) across the versions. Likewise, I think lumping the different types of drift together would be rather coarse since different types of drift have different consequences and may be more/less predictable than others.

A third but more minor concern is what predicting label and intensional drift mean in practice. The party most likely to be concerned about concept drift is the maintainer of the taxonomy ... which is the same party in control of changing the labels and the concept definitions. Hence it is like they are being told in advance what they are going to do. I suppose someone from the community could use prediction to see which concepts are likely to change in the next version, but emailing the maintainers would seem more prudent. I'm a bit confused by this.

In general, it's an interesting topic and the author have some interesting data to play with and evaluate. I'm quite torn on the verdict. On the one hand, the problems with the paper seem to be ones of clarification that could be easily resolved; on the other hand, the details I'm missing feel quite major in that I do not have the whole picture to make a judgement: things like not knowing what "refinement" actually means and how many concepts actually "drifted" in the versions and what notions of drifting were considered. As such, I'll leave my recommendation at borderline.

Again, I quite like the core idea of investigating this notion of concept drift in a Linked Data context. In terms of improving the paper, I think the authors should clarify the above ambiguities. Likewise, I think the authors should consider the different types of concept drift separately from each other in the classification. I'd also like to see some raw statistics on how many of the different types of concept drifts occurred between versions. I would also favour a notion of extensional concept drift that was more fine-grained than "number of instances changed/didn't change". Also, since the authors mentioned in the future work about considering other datasets, they might be interested in looking into the DyLDO snapshots: http://swse.deri.org/dyldo/. There's two years worth of weekly Linked Data snapshots to play with.

MINOR COMMENTS:

Section 1:

* "As concepts drift their meaning" -> "As the meaning of concepts drift(s)"
* "In 6" -> "In Section 6"

Section 3:

* Definition 2: Are two concepts really identical if they have the same rigid intensional definition? For example, it would seem that many sibling classes would thus be identical, no? In any case, the notion of rigidity is not used later it seems. Why is it introduced?
* "cusotimizable" Run a spell-check!
* "dataset which versions" -> "dataset whose versions"
* "dataset which updated versions" -> "dataset whose updated versions"
* Footnote 5: it seems owl:equivalentClass would be more appropriate than owl:sameAs
* "splitted" -> "split"
* "in (OBO/OWL) ontologies" -> this is a very sweeping statement, especially since an OWL ontology could be considered roughly as "expressive"/generic as Linked Data.

Section 5:

* "relations is a count of relationships connecting these concepts/classes": this could mean many things. Please clarify.
* "10-12" -> "10--12"
* "and temporally closed (1795--1971)". This contradicts the earlier statement that the "last 8 versions" of CEDAR are used since the latest version listed in Table 1 is from 1930.
* "under the ROC curve ."
* I don't understand the absolute figures in Table 2. What does 21 mean?

Section 6:

* "Figures 3 and 4 show ..." The discussion refers to Figures 3 and 4 and talks about different ML methods like NaiveBayes and MultiLayerPerceptron, but none of the results data have such information or distinguish the ML method used. That discussion is confusing.
* I like that you present some concrete examples, but the discussion on CollegeCoach is more confusing than helpful. It doesn't really help me see why "it is easy to see why membership and structural features are highly ranked for CEDAR and DBpedia, respectively."
* "leave" -> "leaf"
* "respectivelly" -> "respectively" (Also the word is, in any case, redundant)

Review #2
Anonymous submitted on 27/Aug/2014
Suggestion:
[EKAW] reject
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject

-1

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)

3

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

3

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

2

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

3

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present

4

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

4

Review
Please provide your textual review here.

In this paper, the authors propose a method for predicting concept drift in linked data. Their method generalizes an existing domain-specific method for predicting the extension of concepts. The problem of concept drifting is important in ontology engineering. It is closely related to ontology evolution. The feature extraction method given in this paper is similar to change discovery. To predict the extension of concepts, the authors adopts a supervised learning method. Experimental results show the usefulness of the proposed method.

This paper is clearly written and easy to follow. However, it suffers some major drawbacks:

First, in Definition 2, the authors make an assumption that concepts with identical rigid intensions will have the same URIs. The question is, why this assumption is reasonable? Since they are working on linked data, I do not really think this assumption is valid. I also think the similarity function
between two concepts w.r.t. ext is not reasonable as it only counts the cardinality of th extension of concepts. These two problems are fundamental. I think they should be resolved.

Second, the originality of the work is low. Indeed, this work is similar to the work presented in [15] in several aspects. In [15], the authors considered several supervised learning methods for concept extension and have conducted extensive experiments. In this work, the authors follow the idea and use supervised learning to predict concept drift. In [15], the authors considered structured features,
such as distance to a leaf term, isa relation, direct and indirect terms, terms at a given depth. These features are used in the current submission.

Due to the above two problems, I think this work lacks of originality and has some technical problems. Thus, I do not recommend to accept it.

Review #3
Anonymous submitted on 31/Aug/2014
Suggestion:
[EKAW] conference only accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
X 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
X 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

X 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
X 3 fair
== 2 poor
== 1 very poor

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
X 3 fair
== 2 poor
== 1 very poor

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
X 4 good
== 3 fair
== 2 poor
== 1 not present

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
X 4 good
== 3 fair
== 2 poor
== 1 very poor

Review

The paper entitled 'Predicting Concept Drift in Linked Data' describes a framework to study the change in ontological concepts used in the context of Linked Data. The topic as such is very interesting, timely, and clearly matches the EKAW 2014 call.

The paper is very well written and structures. The illustrations and tables are easy to read and provide important details, e.g., the workflow of the framework/pipeline or the top selected features for a given dataset. As will be pointed out below, the paper does not provide some important details. Therefore, in order to save some space, the authors could reduce the amount of versions listed in table 1. Table 3 needs to be better embedded in the text, especially in relation to figures 3 and 4. While the motivation paragraph in section 1 is too short, I wish all papers would clarify their contribution, targeted research question, and findings as clearly as this paper.

The related work section is sufficient for a conference-style paper and lists key work in the areas of ontology evolution and concept drift. The authors make heavy use of some reference items such as 15 and 16, but overall the paper is sufficiently self-contained. There are, however, two key areas that are not covered: literature on semantic similarity measures (for description logics concepts/classes) and literature on the nature of concepts. There is a variety of work on semantic similarity in the Semantic Web literature as well as in the cognitive sciences. I will discuss their importance for the paper at hand below. On the concept side, and given the paper limit, I would propose: 'Laurence, S., & Margolis, E. (1999). Concepts and cognitive science. Concepts: core readings, 3-81'. Again, details are given below.

The paper has three main shortcomings. First, the relatively small delta; second, the lack of important details which makes the results not reproducible; and third, the weak definitions in section 3 (which arguably is the core foundation of the paper).

(I) The core contribution of the presented paper is a two-fold extension of paper 15 ("Predicting the Extension of Biomedical Ontologies"). First, the intentional perspective on concepts is taken into account. Second,the scope is broadened by investigating applications in other domains (e.g., the digital humanities). This part is labeled 'domain-independent' which is overselling it a bit. The delta and innovation (with respect to items 15 [and 16]) seems rather small especially given the fact that the intentional part of the paper is not well worked out (see III).

(II) While I enjoyed reading the paper, I also felt that the authors tried to put too much into it. Important details, e.g., about the link aggregator and normalizer, are missing. To give one example: what is meant by discarding outliers in this context? Is this based on standard deviations or another approach? Is outlier detection and removal really useful when the task is the prediction of future drifts? Similarity, the examples for the structure-driven and data-driven feature selection in section 4 seem very counter intuitive. While I understand that they are taken from previous work, more details are required to justify their use. For instance, the authors state '(e.g. if a class has a single sublcass, both should be merged'). I would imagine evolutionary biologist being very unhappy about this statement. Ironically the Homo Sapiens is such an example, but to the best of my knowledge there are many more examples of monotypic taxa. I realize that there have been other Homo species before but they are not necessarily listed in a taxonomy of current lifeforms. There are also other cases; I think the Ailurus genus has only one species (but I am not an expert). The same argument can be made for the data-driven aspect. The authors argue that 'if a class has many instances, the class should be splitted'. What about ant species? Each of them contains orders of magnitude more individuals than a particular species of elephants. Nonetheless they will not be spitted as their number just reflects their adaptive strategy (see r/K selection theory). I may simply misunderstand what the authors mean but the given examples clearly invite controversies.

(III) My main concern, however, are the definitions in section 3. This is for two reasons. First, it is not really clear what 'concepts' are and second the definitions and similarity measures are counter-intuitive (or wrong?). With respect to concepts, I am not really sure whether the authors mean 'classes' here or something else. In the footnote they state that ' In general, users need to define their own identity functions between concept versions, e.g. using owl:sameAs or skos:exactMatch mappings'. Shouldn't this be owl:equivalentClass, etc or am I misunderstanding something? More importantly, however, are definitions 1-3 and sim_int, sim_ext_ and sim_label. What If one would remove thousands of individuals and replace them by the same count of entirely other individuals? The cardinality would stay the same. More importantly, this strong dependency on counts may introduce surprising results. To give one example, geographers sometimes only consider 6 continents while conventionally we often refer to 7 continents. This single difference, however, is very meaningful. Similar arguments can be made for definition 2 and sim_int. What happens if I remove a global domain or range restriction in a new version of an ontology?

Summing up, this is a very readable paper on a timely and interesting topic. It is a great match for EKAW's diversity theme. The delta to previous work seems rather small and there are many aspects that either lack the sufficient detail or seem not well worked out. There is clearly more potential here. As argued above, the page limit could have been the problem here?