|Review Comment: |
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
The submission presents results for two related categorization tasks: Whether two products advertised by different Web vendors (i.e. in potentially different ways) are the same or not, and under which category of a simple product hierarchy a given product can be subsumed. The main approach is to combine elements from supervised and unsupervised learning, “classical” classifier learning with neural-network/deep-learning approaches, and textual cues with image cues. The main result is that the addition of these latter techniques increases the quality (as measured by standard measures) of the results.
The paper is an extension of a recent conference paper by the same authors. The additions over and above the conference paper are limited, but the increment is within the bounds of many journals’ requirements on what should be added to a conference paper to make it an acceptable journal paper.
In terms of content, the originality is limited. The product matching and categorization tasks themselves have been studied extensively for some years. There are real-world examples of the application on which this paper focusses, although they are admittedly far from perfect.
In terms of methods, the originality is limited. The paper applies known methods, albeit in a creative way that is well-suited to the particular problem (product matching and categorization for search goods) that the empirical evaluation tackles.
SIGNIFICANCE OF THE RESULTS:
The general task this paper deals with is wide and potentially interesting also for a wider audience.
However, the authors delimit the task that they actually address strongly. They only look at search goods - and search goods most of which are described by brand and type name, such that matching becomes a near-trivial shallow linguistic processing in which the letter-number combinations that identify a particular mobile phone etc. sometimes have a hyphen and sometimes not, and in which the brand is sometimes named before and sometimes after the type. (This characterization is based, among other things, on the example shown on the WDC website, and I am aware that it is a simplification that will not apply to all pairs of products. Still, the matching task for these products is substantially simpler than that for other products.) The finding, in the paper (p.10 bottom), that the product name field is the best feature for the matching supports this suspicion. And the question arises to what extent the findings that results become better than with other methods depends on this restriction.
The finding about image features is potentially interesting, although it mostly confirms what one knows about the marketing of these goods in saturated markets: for example, that all smartphones look rather similar, in particular those by the same brand, and that product photography aesthetics are quite homogeneous, such that the main differentiation between brands are things such as the “look-and-feel” created by fonts, shapes, colour schemes, etc. Image features should be investigated in a broader view of the matching and categorization tasks.
It is unclear how the methods would behave for other products, in particular experience goods and credence goods. (I would expect word embeddings to work very well for these, since marketing speak needs to envelop customers in a discourse that plays on the to-be-expected experience and/or the to-be-given credence. - I have no idea how image features would play out for such goods.)
It is also unclear how scientists and practitioners not interested in the specific application considered here (product aggregators) can profit from the described methods and results. In other words: For what other questions are these methods and findings valuable, and why? The “use case” of Section 6 goes in this direction, although it still sits squarely in the advertising domain.
Most importantly, it is unclear when and how the proposed method produces errors, and what can be learned from this.
Thus, in general, to reach higher significance, the method should be tested for more diverse settings, the authors should not strive only for “better evaluation measures” but also for “interesting failures and in-depth error analyses”, more diverse application areas should at least be discussed, and a section on limitations added.
The writing in general is good, although some typos and grammatical errors remain. Samples, please carefully check for further errors:
In the recent years → in recent years
empiric → empirical
Remove trailing closing brackets in the list in 3.1.
Last sentence on p.5 is not a sentence.
Within “p1.val(f)”, the formatting changes. This is extremely hard to read.
You talk about “significant differences” in values (e.g. on p.9), but I don’t see evidence of statistical tests, or their results?
The PCA description and purpose are unclear. How is the PCA set up, what components did you find, were these just used to produce Fig.4 or also as input for further analyses? How and why did you “select several attribute-value pairs”?
The use case of Section 6 is interesting, but somewhat inconclusive. What would these results imply for an application? Would a vendor be happy to have an advertising intermediary add content to the product description that they provide (maybe there’s a reason for not providing all the information? And what if it is faulty? What if it makes the user go to the product aggregator or even a competitor?)? What is the “viewing pipeline” that you assume a web user / potential customer goes through?