Review Comment:
The paper describes EMBench++, an extension of a tool to generate benchmarks for entity resolution. The tool (significantly) extends a previous tool (EMBench) with more features, in particular supporting evaluation in contexts where entity schemas are different and entities evolve along time.
The tool and its new features provide several functionalities and control mechanisms over the generated datasets. In addition the feature to generate evolving entity sets is to me extremely useful to evaluate approaches that tackle the important topic of matching representations that change along time. Thus, the paper contains a contribution that is original compared to previous work.
The paper is reasonably understandable, although it contains several typos and some paragraphs require further explanations. In particular, the explanation of Figure 5 and concepts used therein is not clear and need to be improved.
I am quite convinced about the usefulness of the tool. Section 5 provides good arguments for the effectiveness of control mechanisms for the generation of datasets with desired features. As a small remark, Section 5 does not discuss a proper usability evaluation. Thus I suggest changing its title. What instead is missing in my opinion is an equally convincing argument for the usefulness and representativeness of the datasets that are generated.
I can better explain these concerns with a set of questions that the authors should address in the new version of the paper (in case of major revision).
• Have datasets generated with EMBench and EMBench++ been used to evaluate some entity resolution approach so far? If yes, to what extent?
• How the data generated with the tool relate to other datasets used to evaluate entity resolution approaches? The relation to OAEI is quite clear, but other datasets have been used e.g., DBpedia and LinkedGeoData [1], or ACM-DBLP, Amazon-Google and Abt-Buy datasets used in [2]. Can you generate datasets of the same size of datasets used so far? Is there an upper bound to the size of datasets that you can generate? More specifically, these questions become important in relation to datasets used to evaluate approaches for collective matching, entity evolution (a missing reference for this body of work is to temporal record linkage [3]), and blocking-based methods. In other words, arguments should be given (after the description of the tool, for example in a subsection of Section 5) to convince that the generated datasets have the features that are required to test these approaches. To this end, you may compare datasets that you can create with EMBench++ to datasets used in these domains (e.g., making a table that list datasets used so far for evaluation and their features, e.g., synthetic vs non synthetic, size, etc..), or to sum up main challenges found for these approaches in the sate-of-the-art and discuss how these challenges are well represented in datasets built with EMBench++.
• Will the tool be publicly available? (I consider this as an important factor for acceptance)
For the above-mentioned reasons, and despite the merit of the work described in the paper, I think that the paper needs to be revised before acceptance. I add more detailed comments below the reference.
[1] Kevin Dreßler, Axel-Cyrille Ngonga Ngomo: On the efficient execution of bounded Jaro-Winkler distances. Semantic Web 8(2): 185-196 (2017)
[2] Axel-Cyrille Ngonga Ngomo, Mohamed Ahmed Sherif, Klaus Lyko: Unsupervised Link Discovery through Knowledge Base Repair. ESWC 2014: 380-394
[3] Pei Li, Xin Luna Dong, Andrea Maurino, Divesh Srivastava: Linking Temporal Records. PVLDB 4(11): 956-967 (2011)
DETAILED COMMENTS
P. 2
The benefit of using synthetic data … → Yes, but real-world data may reflect more heterogeneities that can be found in real-world matching problems. I suggest discussing pros and cons of synthetic datasets.
P. 4
It took me a while to understand if derived column tables contain again one column each (which results from rules applied to column tables) or contain more columns. I suggest stating this more explicitly.
"for for the attribute values" → for the attribute values
"These sources is" → These sources are
P. 5
Figure 2: Do you support URIs as identifiers? If yes, do you support URI generation schemes (e.g., www.example.org./person/Noela_Kulgen)? It may be helpful for benchmarking entity resolution for RDF.
“or the same set but different level of destruction. “ → please, revise “levels of destruction”.
“Advance Generation of Benchmark Data “ → Advanced Generation of Benchmark Data (?)
“either data among the different entities or the heterogeneity level. “ → please, revise.
“incorporating a particular type of “ → incorporating particular types of
“I_m=f_m( f_m−1(... f(I))). “ → shouldn’t be I_m=f_m( f_m−1(... f_1(I))) ?
I suggest revise the explanation of Definition 2. It is a very intuitive concept (and definition) if the intuition that each modified entity set contains a subset of unmodified entities along with a set of modified entities is explained before the formal definition (e.g., using explanations in the paragraph “EMBench++ uses c to split […]”.
P. 6
“related to its ability of the approach to “ → related to its ability to
“ensuring foreign keys agree “ → ensuring that foreign keys agree
“using attributes that are other entity types “ → using attributes that have entities of other types as values (?)
Use listings for pieces of code, add a caption and reference them (similarly to Figures) in the text.
P. 7
Fig. 3.: you have “F_a,” for C-A1, but “F_a,” for C-A2 (similarly for C-B1 and C-B2). Harmonize the notation in the figure.
“and keep incrementally include entities” → please, revise.
The adjust function in Definition 4 should be explained better. I suggest also using an example.
“we first apply the operators over F_a and generated F_b” → we first apply the operators over F_a and generate F_b (?)
“C-A is a independent” → C-A is an independent
P. 8
I like Figure 4 but all mechanisms should be exemplified (listing are helpful but less than example of transformations).
“four investigated characteristic” → four investigated characteristics
P. 9
The plots and the tables in Fig. 5 should be explained more in details, as well as concepts needed to understand them. For example, percentage of clean vs duplicated entities is not clear (what is the total in 5.6%?). Why entities are always 12.000 in Collection B is not clear. In Collection C, I guess that it sould be 90000 instead of 9000. The semantics of circles in the plots is also unclear, because it centers of the circles seem to represent points in the plane. I guess you want to represent the size of the datasets, but size seems to be represented by the x axis.
“sets have everything the same except the number of attribute entities.” → Please revise “sets have everything the same”, and I think it should be “number of entity attributes”.
“data sets those entities evolved in time.“ → please, revise.
|