EMBench++: Benchmark Data for Thorough Evaluation of Matching-Related Methods

Tracking #: 1822-3035

Authors: 
Ekaterini Ioannou
Yannis Velegrakis1

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Abstract: 
Matching-related methods, i.e., entity resolution, search and evolution, are essential parts in a variety of applications. The specific research area contains a plethora of methods focusing on efficiently and effectively detecting whether two different pieces of information describe the same real world object or, in the case of entity search and evolution, retrieving the entities of a given collection that best match the user’s description. A primary limitation of the particular research area is the lack of a widely accepted benchmark for performing extensive experimental evaluation of the proposed methods, including not only the accuracy of results but also scalability as well as performance given different data characteristics. This paper introduces EMBench++, a principled system that can be used for generating benchmark data for the extensive evaluation of matching-related methods. Our tool is a continuation of a previous system, with the primary contributions including: modifiers that consider not only individual entity types but all available types according to the overall schema; techniques supporting the evolution of entities; and mechanisms for controlling the generation of not single data sets but collections of data sets. We also present collections of entity sets generated by EMBench++ as an illustration of the benefits provided by our tool.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Michelle Cheatham submitted on 05/Mar/2018
Suggestion:
Accept
Review Comment:

The authors' modifications to the related work section and significant expansions to the evaluation presented in Section 5 have resolved my primary concerns with this paper. The only issue that I still have is that another proofreading pass should be done, since there are still numerous typos and grammatical issues.

Review #2
By Matteo Palmonari submitted on 13/Mar/2018
Suggestion:
Minor Revision
Review Comment:

The authors have addressed all my questions and I believe that the paper is now almost ready for publication. I am particularly satisfied with Section 5 and how it is organized now.

I give minor revision because the paper requires a last check for typos and grammatical errors (there are quite a few in the current version). I also have some additional minor comments that the authors may consider to improve the paper (about definitions and about providing a couple of missing intuitive explanations). However, I believe the authors will be able to make the few suggested changes without requiring an additional review.

See detailed comments here below.

*ABSTRACT:

“Matching-related methods, i.e., entity resolution, search and evolution” → “entity evolution” does not seem the name of a method

*Introduction:

I suggest defining more precisely the task of “entity evolution”

“produced with in” → remove “with”?

The list of contributions is not completely balanced (We introduce, we are generating, Provide – without “we”): I suggest making it more balanced (we + present form)

*Section 2

“open challenged” 
→ open challenges


*Section 3

I still miss an intuitive overview of how EMBench++ works. Despite most of the readers are familiar with the entity-matching tasks, I suggest introducing the key idea behind EMBench++: create lists from legacy sources, recombine them and then modify them/corrupt them in such a way that a gold standard can be generated.

Our current implementation, supports identify function rule meaning that the resulted derived table → Our current implementation, supports identity (?) function rule, meaning that the derived table

rules that combining → rules that combine
for set of Entity Modifiers → for a set of Entity Modifiers

refer to as propagation type → referred to as propagation type

*Section 4

In Definition 1, I found a bit confusing that I \in N. Later on I is defined as a set of tuples, thus this piece of the definition seems to mix naming and mathematical definitions. I suggest to double check it.

where i is the identification name → do you mean that i is the identifier of the modifier?

Initially, the entity set I_k−1 → the index here comes out of the blue and it is a bit confusing (since you have used k to number the attribute, one may think that a reference to the attribute number is given). Later on it is clear that this index refer to the entity set to which k-1 modifiers have been applied. I suggest providing the intuitive interpretation of the index here.

I found a bit unintuitive referring to an entity matching scenario as a tuple as this is only an entity pair to match, while the term “matching scenario” rather points to a matching task, or, to a set of entity pairs to match.

from level to level → I think that something is missing in the sentence.

an analysis of DBPedia [33,34], revealed → an analysis of DBpedia [33,34] revealed

Elimination: that removes → Elimination: this removes (same for “Expansion”).

*Section 5

and only few include entities of other types → here “other” is not clear yet; I suggest rephrasing the sentence.

collections do contain → collections do not contain (?)

is that the absence → is the absence

which 36.000 are clear → which 36.000 are clean (?)

The VLDB 2018 publication → I suggest rewriting this in a sentence similar to “EMBench++ has been used in a work …”

*Section 6

included usage of the available schema → included the usage of the available schema

Review #3
By Matthias Wauer submitted on 17/May/2018
Suggestion:
Minor Revision
Review Comment:

I've reviewed the submission following the approach described in [1]:

Read the introduction.
Identify the BIG QUESTION.
→ How to generate comprehensive benchmark data for entity matching.

Summarize the background in five sentences or less.
→ Existing benchmarks exist using real and synthetic data. Using real data has obvious shortcomings, such as insufficient ground truth. For synthetic data, the existing generators are less powerful in terms of control a user has over modifications, according to the authors. They state that their approach also supports generating volatile data. Further, the authors discuss existing matching methods and indicate that current benchmarking systems only target similarity methods, but fail to address collective matching, entity evolution and large-scale data processing as in blocking methods.
I’m missing a discussion of/comparison to a few relevant recent contributions in this field, such as SPIMBENCH (https://www.ics.forth.gr/_publications/GraphTA15.pdf). Also, despite its focus on spatial data, the linking benchmarks of the HOBBIT project (as in OAEI 2017.5) are related, but since it’s very specific I don’t object to omitting it from this discussion.

Identify the SPECIFIC QUESTION(S).
→ how can the existing EMBench system be extended to provide means for generating benchmarks addressing entity matching on data sets with relations between entities, over collections of data sets with certain characteristics, and for volatile data.

Identify the approach.
→ The authors extend the system, defining the respective operations on modifications, attributes and values.

Read the methods section.
→ Apparently the generation can be configured in XML. Although the discussed modification options appear to be reasonable, there is no clear indication why the authors have decided on this set of modifications (coming from related work? requirements gathered somehow? etc.), or which additional modifications may be relevant for certain use cases. The authors mention a study on DBpedia regarding its volatility, which might include a analysis on common types of changes, and common patterns like entity merge and split. However, this is not clearly discussed.

Read the results section.
→ The empirical evaluation very well lays out advances over static benchmark collections. It continues to discuss the properties of collections generated by EMBench++, also considering evolution and (to some degree) comparing it to real-world datasets. Finally, it briefly discusses an application of the framework in a recent experimental evaluation.

Do the results answer the SPECIFIC QUESTION(S)?
→ It does answer these questions to some degree. While the authors have addressed the review comments of Reviewer 2 in their evaluation, my impression is that the added information is on reasonable, but relatively straightforward things (e.g., Zipfian distribution of names when this distribution is selected in the configuration (Fig. 9); evolved entites when a similar percentage has been configured(Fig. 10)). What is lacking is the following aspects of evaluation:
1) Are the configurable modifications for the evolution (e.g., the modification functions) representative of the modifications that can be found in real datasets? In this regard, the paper neither lists all the available modifications (it mentions volatility and missspelling in examples), nor does it examine the modifications in a real dataset, e.g., using a random sampling approach, to see whether the kind of changes to these entities can be represented by the configuration options of EMBench++.
2) Is the framework scalable w.r.t. generating large datasets? This might be compared to, e.g., SPIMBENCH which appears to provide runtime measurements for up to 500M triples dataset size.

Read the conclusion/discussion/Interpretation section.
→ The conclusion section might be extended with an outlook of potential future extensions.

Now, go back to the beginning and read the abstract. Does it match what the authors said in the paper? Does it fit with your interpretation of the paper?
→ Yes, the abstract is a very good representation of the paper.

General comments:

The paper describes a system called EMBench++, an extension of an existing system intended for generating benchmarks for entity matching. The authors work in a field of active and very relevant research, and contributions to benchmarking methods for linking entities in different datasets in a reproducible way is of great interest. The extension provides novel contributions to important issues in real applications, such as entity matching on evolving datasets.

Although the authors have clearly extended the paper in this regard, there are still missing aspects in the evaluation that would be necessary to assess whether the tool can actually generate benchmark datasets as found in real datasets. In particular, while Section 4 provides a well-defined formalization of the approach, I think the authors could discuss in more detail which modifications are available (misspelling operators etc.) and how they identified those that are necessary (manual study of real datasets etc.). Likewise, a comparison of the detailed generated modifications with real dataset modifications would be an interesting evaluation aspect. Last, the authors don't discuss runtime behaviour, i.e., the time and memory required for generating the benchmark datasets.

Quality of writing is generally good. However, the authors should fix several missing/duplicate words, typos or confusing sentences, including but not limited to:
- "has been modified overtime time"
- "moves towards a thorough experimental evaluations"
- "..., which (primarily) include other than entity matching and search, also entity evolution." -> please rewrite, it's hard to understand
- "produced with in a controlled" -> within
- "Provide mechanisms that allow..." -> "We provide..."
- "the thorough experimental evaluating matching-related" -> "evaluation of..."
- "a number of open challanged."
- "On 2004" -> "In 2004"
- "The particular benchmark, first includes entities" (punctuation)
- "to rerate to each other"
- "Ik-1 into two sets" (missing word?)
- "Our current implementation, supports identify function... them or with strings" (very confusion punctuation)
- "EMBench contains implementations for set of" -> "for a set of"
- "different options, refer to as propagation type"
- "In particularly," -> "In particular,"

Further remarks:
- Number formatting should be fixed (e.g. in Figure 7), such as 38,000 and 5.6%
- Section 5.4 mentions a VLDB 2018 publication (remove the comma after the citation), but the reference says 2017.

[1] https://violentmetaphors.com/2013/08/25/how-to-read-and-understand-a-sci...