EMBench++: Benchmark Data for Thorough Evaluation of Matching-Related Methods

Tracking #: 1589-2801

Ekaterini Ioannou
Yannis Velegrakis

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Matching-related methods, i.e., entity resolution, search and evolution, are essential parts in a variety of applications. The specific research area contains a plethora of methods focusing on efficiently and effectively detecting whether two different pieces of information describe the same real world object or, in the case of entity search and evolution, retrieving the entities of a given collection that best match the user's description. A primary limitation of the particular research area is the lack of a widely accepted benchmark for performing extensive experimental evaluation of the proposed methods, including not only the accuracy of results but also scalability as well as performance given different data characteristics. This paper introduces EMBench++, a principled system that can be used for generating benchmark data for the extensive evaluation of matching-related methods. Our tool is a continuation of a previous system, with the primary contributions including: modifiers that consider not only individual entity types but all available types according to the overall schema; techniques supporting the evolution of entities; and mechanisms for controlling the generation of not single data sets but collections of data sets. We also present collections of entity sets generated by EMBench++ as an illustration of the benefits provided by our tool.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Matteo Palmonari submitted on 14/Apr/2017
Major Revision
Review Comment:

The paper describes EMBench++, an extension of a tool to generate benchmarks for entity resolution. The tool (significantly) extends a previous tool (EMBench) with more features, in particular supporting evaluation in contexts where entity schemas are different and entities evolve along time.

The tool and its new features provide several functionalities and control mechanisms over the generated datasets. In addition the feature to generate evolving entity sets is to me extremely useful to evaluate approaches that tackle the important topic of matching representations that change along time. Thus, the paper contains a contribution that is original compared to previous work.

The paper is reasonably understandable, although it contains several typos and some paragraphs require further explanations. In particular, the explanation of Figure 5 and concepts used therein is not clear and need to be improved.

I am quite convinced about the usefulness of the tool. Section 5 provides good arguments for the effectiveness of control mechanisms for the generation of datasets with desired features. As a small remark, Section 5 does not discuss a proper usability evaluation. Thus I suggest changing its title. What instead is missing in my opinion is an equally convincing argument for the usefulness and representativeness of the datasets that are generated.

I can better explain these concerns with a set of questions that the authors should address in the new version of the paper (in case of major revision).

• Have datasets generated with EMBench and EMBench++ been used to evaluate some entity resolution approach so far? If yes, to what extent?

• How the data generated with the tool relate to other datasets used to evaluate entity resolution approaches? The relation to OAEI is quite clear, but other datasets have been used e.g., DBpedia and LinkedGeoData [1], or ACM-DBLP, Amazon-Google and Abt-Buy datasets used in [2]. Can you generate datasets of the same size of datasets used so far? Is there an upper bound to the size of datasets that you can generate? More specifically, these questions become important in relation to datasets used to evaluate approaches for collective matching, entity evolution (a missing reference for this body of work is to temporal record linkage [3]), and blocking-based methods. In other words, arguments should be given (after the description of the tool, for example in a subsection of Section 5) to convince that the generated datasets have the features that are required to test these approaches. To this end, you may compare datasets that you can create with EMBench++ to datasets used in these domains (e.g., making a table that list datasets used so far for evaluation and their features, e.g., synthetic vs non synthetic, size, etc..), or to sum up main challenges found for these approaches in the sate-of-the-art and discuss how these challenges are well represented in datasets built with EMBench++.

• Will the tool be publicly available? (I consider this as an important factor for acceptance)

For the above-mentioned reasons, and despite the merit of the work described in the paper, I think that the paper needs to be revised before acceptance. I add more detailed comments below the reference.

[1] Kevin Dreßler, Axel-Cyrille Ngonga Ngomo: On the efficient execution of bounded Jaro-Winkler distances. Semantic Web 8(2): 185-196 (2017)

[2] Axel-Cyrille Ngonga Ngomo, Mohamed Ahmed Sherif, Klaus Lyko: Unsupervised Link Discovery through Knowledge Base Repair. ESWC 2014: 380-394

[3] Pei Li, Xin Luna Dong, Andrea Maurino, Divesh Srivastava: Linking Temporal Records. PVLDB 4(11): 956-967 (2011)


P. 2

The benefit of using synthetic data … → Yes, but real-world data may reflect more heterogeneities that can be found in real-world matching problems. I suggest discussing pros and cons of synthetic datasets.

P. 4

It took me a while to understand if derived column tables contain again one column each (which results from rules applied to column tables) or contain more columns. I suggest stating this more explicitly.

"for for the attribute values" → for the attribute values

"These sources is" → These sources are

P. 5

Figure 2: Do you support URIs as identifiers? If yes, do you support URI generation schemes (e.g., www.example.org./person/Noela_Kulgen)? It may be helpful for benchmarking entity resolution for RDF.

“or the same set but different level of destruction. “ → please, revise “levels of destruction”.

“Advance Generation of Benchmark Data “ → Advanced Generation of Benchmark Data (?)

“either data among the different entities or the heterogeneity level. “ → please, revise.

“incorporating a particular type of “ → incorporating particular types of

“I_m=f_m( f_m−1(... f(I))). “ → shouldn’t be I_m=f_m( f_m−1(... f_1(I))) ?

I suggest revise the explanation of Definition 2. It is a very intuitive concept (and definition) if the intuition that each modified entity set contains a subset of unmodified entities along with a set of modified entities is explained before the formal definition (e.g., using explanations in the paragraph “EMBench++ uses c to split […]”.

P. 6

“related to its ability of the approach to “ → related to its ability to

“ensuring foreign keys agree “ → ensuring that foreign keys agree

“using attributes that are other entity types “ → using attributes that have entities of other types as values (?)

Use listings for pieces of code, add a caption and reference them (similarly to Figures) in the text.

P. 7

Fig. 3.: you have “F_a,” for C-A1, but “F_a,” for C-A2 (similarly for C-B1 and C-B2). Harmonize the notation in the figure.

“and keep incrementally include entities” → please, revise.

The adjust function in Definition 4 should be explained better. I suggest also using an example.

“we first apply the operators over F_a and generated F_b” → we first apply the operators over F_a and generate F_b (?)

“C-A is a independent” → C-A is an independent

P. 8

I like Figure 4 but all mechanisms should be exemplified (listing are helpful but less than example of transformations).

“four investigated characteristic” → four investigated characteristics

P. 9

The plots and the tables in Fig. 5 should be explained more in details, as well as concepts needed to understand them. For example, percentage of clean vs duplicated entities is not clear (what is the total in 5.6%?). Why entities are always 12.000 in Collection B is not clear. In Collection C, I guess that it sould be 90000 instead of 9000. The semantics of circles in the plots is also unclear, because it centers of the circles seem to represent points in the plane. I guess you want to represent the size of the datasets, but size seems to be represented by the x axis.
“sets have everything the same except the number of attribute entities.” → Please revise “sets have everything the same”, and I think it should be “number of entity attributes”.

“data sets those entities evolved in time.“ → please, revise.

Review #2
Anonymous submitted on 27/Apr/2017
Major Revision
Review Comment:

The paper introduces EMBench++, a framework to generate benchmark data. In addition to the previous system, modifiers are introduced which modify data across entity types, a technique is implemented to simulate the evolution of entities and mechanisms are included to generate not only one single data set but whole collections of data sets.
Altogether, the paper is well written and nicely describes the framework as well as the enhancements of the systems. It is getting clear why there is a need for benchmarks supporting for example the evolution of entities because that is an emerging challenge for entity linking systems.

It would be good to know if the previous system has found widespread support in the entity linking community and whether systems are actually using the generated data sets. The main criticism is a missing evaluation to assess whether the data sets represent a realistic scenarios and to show the behavior of matching systems on the data sets. The characteristics of one data set are investigated but only regarding indicators like the number of entities or the number of duplicates. It would be important to see whether the data sets present realistic matching scenarios. For example, it is not clear if the implemented mechanism to simulate the evolution of the entities, generates data sets that actually represent this scenario. An application of state-of-the-art entity linking systems could show the applicability of the generated data sets and in turn of the EMBench++ framework. Such experiments could be similar to the ones presented by the same authors in the paper "On generating benchmark data for entity matching", where the influence on the modification level on the effectiveness and execution time of one matching system is depicted.
To review whether a realistic scenario is for example presented by the data generated for entity evolution, characteristics of entities in a knowledge base that actually evolve could be gathered and compared to the characteristics of the created data set. By applying a matching system, the influence of the evolution on the performance etc. can be shown.

Review #3
By Michelle Cheatham submitted on 30/Aug/2017
Minor Revision
Review Comment:

This paper describes a system called EMBench++, which generates synthetic tests sets for evaluation of entity matching algorithms.

The paper is in general well-written and easy to follow, and the tool is likely of practical interest to researchers in this field.

I have two primary concerns with the paper. One is related to the related work section. This section describes both synthetic data generation systems and entity matching algorithms. In my view, the description of other synthetic data generation systems (with the exception of ISLab) is somewhat lacking, in that it does not clearly indicate the functionality of all systems mentioned and how that compares to EMBench++. For instance, the paper states that when compared to SWING “EMBench has more expressive power and offers more flexibility in the specification of the testing data.” More detail is needed here in order to fully establish the novelty of EMBench++. It might be helpful to include a table comparing the available features of all of the synthetic data generations systems mentioned.

My second concern is that there is no grounding of the synthetic datasets produced by the tool. In order to establish the utility of EMBench, it would be helpful to provide an argument (either logical or empirical) that the synthetic datasets it produces reflect the types of datasets found in the real world.

Minor issues:

The references given on page one for matching-related methods (10, 18) are very narrow. This is a very established field, so it would be helpful to either refer to more systems or to a survey paper.

When mentioning string similarity metrics in the context of entity matching, it might be helpful to reference Cheatham, Michelle, and Pascal Hitzler. "String similarity metrics for ontology alignment." International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013. (Full disclosure: this is my own paper, but it does seem particularly relevant here.)

In Section 2.2, it might be helpful to mention that techniques A through D are not mutually exclusive.

The paper seems unnecessarily verbose in some places. For example, the description of foreign key relationships in Section 4.1, particularly the description of Figure 2(a), is somewhat wordy. The descriptions of the various transformations are also rather in-depth, considering that they are fairly straightforward in most cases.


“with the most advance being the ability” -> “with the most advanced being the ability”

“can be used for correcting previous matching mistakes errors” -> only need either mistakes or errors, rather than both

“the systems also uses rules” -> “the system also uses rules”

“the mechanisms focuses on” -> “these mechanisms focus on”

“these sources is a (Derived)” -> “these sources are a (Derived)”

Definition 1 should probably mention that e sub 1 through e sub n are drawn from O.

“and keep incrementally include entities” -> “and keep incrementally including entities”?

“C-B2generated” -> “C-B2 generated”

“contains data sets those entities evolved in time” -> “contains data sets whose entities evolved over time”

“exactly the same except the number of duplicated entities” -> “exactly the same except for the number of duplicated entities”

“the number of entity attributes to 5” -> “the number of entity attributes is 5”

The numbers in should have commas instead of decimals.