Review Comment:
I've reviewed the submission following the approach described in [1]:
Read the introduction.
Identify the BIG QUESTION.
→ How to generate comprehensive benchmark data for entity matching.
Summarize the background in five sentences or less.
→ Existing benchmarks exist using real and synthetic data. Using real data has obvious shortcomings, such as insufficient ground truth. For synthetic data, the existing generators are less powerful in terms of control a user has over modifications, according to the authors. They state that their approach also supports generating volatile data. Further, the authors discuss existing matching methods and indicate that current benchmarking systems only target similarity methods, but fail to address collective matching, entity evolution and large-scale data processing as in blocking methods.
I’m missing a discussion of/comparison to a few relevant recent contributions in this field, such as SPIMBENCH (https://www.ics.forth.gr/_publications/GraphTA15.pdf). Also, despite its focus on spatial data, the linking benchmarks of the HOBBIT project (as in OAEI 2017.5) are related, but since it’s very specific I don’t object to omitting it from this discussion.
Identify the SPECIFIC QUESTION(S).
→ how can the existing EMBench system be extended to provide means for generating benchmarks addressing entity matching on data sets with relations between entities, over collections of data sets with certain characteristics, and for volatile data.
Identify the approach.
→ The authors extend the system, defining the respective operations on modifications, attributes and values.
Read the methods section.
→ Apparently the generation can be configured in XML. Although the discussed modification options appear to be reasonable, there is no clear indication why the authors have decided on this set of modifications (coming from related work? requirements gathered somehow? etc.), or which additional modifications may be relevant for certain use cases. The authors mention a study on DBpedia regarding its volatility, which might include a analysis on common types of changes, and common patterns like entity merge and split. However, this is not clearly discussed.
Read the results section.
→ The empirical evaluation very well lays out advances over static benchmark collections. It continues to discuss the properties of collections generated by EMBench++, also considering evolution and (to some degree) comparing it to real-world datasets. Finally, it briefly discusses an application of the framework in a recent experimental evaluation.
Do the results answer the SPECIFIC QUESTION(S)?
→ It does answer these questions to some degree. While the authors have addressed the review comments of Reviewer 2 in their evaluation, my impression is that the added information is on reasonable, but relatively straightforward things (e.g., Zipfian distribution of names when this distribution is selected in the configuration (Fig. 9); evolved entites when a similar percentage has been configured(Fig. 10)). What is lacking is the following aspects of evaluation:
1) Are the configurable modifications for the evolution (e.g., the modification functions) representative of the modifications that can be found in real datasets? In this regard, the paper neither lists all the available modifications (it mentions volatility and missspelling in examples), nor does it examine the modifications in a real dataset, e.g., using a random sampling approach, to see whether the kind of changes to these entities can be represented by the configuration options of EMBench++.
2) Is the framework scalable w.r.t. generating large datasets? This might be compared to, e.g., SPIMBENCH which appears to provide runtime measurements for up to 500M triples dataset size.
Read the conclusion/discussion/Interpretation section.
→ The conclusion section might be extended with an outlook of potential future extensions.
Now, go back to the beginning and read the abstract. Does it match what the authors said in the paper? Does it fit with your interpretation of the paper?
→ Yes, the abstract is a very good representation of the paper.
General comments:
The paper describes a system called EMBench++, an extension of an existing system intended for generating benchmarks for entity matching. The authors work in a field of active and very relevant research, and contributions to benchmarking methods for linking entities in different datasets in a reproducible way is of great interest. The extension provides novel contributions to important issues in real applications, such as entity matching on evolving datasets.
Although the authors have clearly extended the paper in this regard, there are still missing aspects in the evaluation that would be necessary to assess whether the tool can actually generate benchmark datasets as found in real datasets. In particular, while Section 4 provides a well-defined formalization of the approach, I think the authors could discuss in more detail which modifications are available (misspelling operators etc.) and how they identified those that are necessary (manual study of real datasets etc.). Likewise, a comparison of the detailed generated modifications with real dataset modifications would be an interesting evaluation aspect. Last, the authors don't discuss runtime behaviour, i.e., the time and memory required for generating the benchmark datasets.
Quality of writing is generally good. However, the authors should fix several missing/duplicate words, typos or confusing sentences, including but not limited to:
- "has been modified overtime time"
- "moves towards a thorough experimental evaluations"
- "..., which (primarily) include other than entity matching and search, also entity evolution." -> please rewrite, it's hard to understand
- "produced with in a controlled" -> within
- "Provide mechanisms that allow..." -> "We provide..."
- "the thorough experimental evaluating matching-related" -> "evaluation of..."
- "a number of open challanged."
- "On 2004" -> "In 2004"
- "The particular benchmark, first includes entities" (punctuation)
- "to rerate to each other"
- "Ik-1 into two sets" (missing word?)
- "Our current implementation, supports identify function... them or with strings" (very confusion punctuation)
- "EMBench contains implementations for set of" -> "for a set of"
- "different options, refer to as propagation type"
- "In particularly," -> "In particular,"
Further remarks:
- Number formatting should be fixed (e.g. in Figure 7), such as 38,000 and 5.6%
- Section 5.4 mentions a VLDB 2018 publication (remove the comma after the citation), but the reference says 2017.
[1] https://violentmetaphors.com/2013/08/25/how-to-read-and-understand-a-sci...
|