Editorial Board

Editors-in-Chief
Krzysztof Janowicz

Managing Editors
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Mark Gahegan
Aldo Gangemi
Anna Lisa Gentile
Rafael Goncalves
Dagmar Gromann
Armin Haller
Aidan Hogan
Katja Hose
Eero Hyvönen
Sabrina Kirrane
Agnieszka Lawrynowicz
Freddy Lecue
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Christoph Schlieder
Stefan Schlobach
Oshani Seneviratne
Cogan Shimizu
Ruben Verborgh
GQ Zhang

Former Editors-in-Chief
Pascal Hitzler

Editorial Assistants
Sanaz Saki Norouzi

Syndicate

A Machine Learning Approach for Product Matching and Categorization

Submitted by Petar Ristoski on 02/21/2017 - 15:30

Tracking #: 1571-2783

A new version of this paper is available

Authors:

Petar Ristoski

Petar Petrovski

Peter Mika

Heiko Paulheim

Responsible editor:

Claudia d'Amato

Submission type:

Full Paper

Abstract:

Consumers today have the option to purchase products from thousands of e-shops. However, the completeness of the product specifications and the taxonomies used for organizing the products differ across different e-shops. To improve the consumer experience, e.g., by allowing for easily comparing offers by different vendors, approaches for product integration on the Web are needed. In this paper, we present an approach that leverages neural language models and deep learning techniques in combination with standard classification approaches for product matching and categorization. In our approach we use structured product data as supervision for training feature extraction models able to extract attribute-value pairs from textual product descriptions. To minimize the need for lots of data for supervision, we use neural language models to produce word embeddings from large quantities of publicly available product data marked up with Microdata, which boost the performance of the feature extraction model, thus leading to better product matching and categorization performances. Furthermore, we use a deep Convolutional Neural Network to produce image embeddings from product images, which further improve the results on both tasks.

Full PDF Version:

swj1571.pdf

Revised Version:

A Machine Learning Approach for Product Matching and Categorization

Previous Version:

A Machine Learning Approach for Product Matching and Categorization

Tags:

Reviewed

Decision/Status:

Minor Revision

Solicited Reviews:

Click to Expand/Collapse

Review #1

By Kristian Kersting submitted on 26/Mar/2017

Suggestion:
Major Revision

Review Comment:

This is a resubmission. Therefore I will only focus on the issues I raised in my previous reviews:

(1) Some aspected of the experimental protocol are unclear: which significance test was used for evaluation, cross-validation used across all experiments?
(2) Missing related work, in particular on existing deep learning approaches for task at hand.
(3) Justification of the deep architecture used
(4) Reducing the number of times the word “deep” is used.

The authors have placed a general reference to deep learning and have included
several additional reference; from originally 20 up to 39. In particular, they have related the present paper to several of the references provided in my first review.
Thanks! So, (2) has been nicely addressed. And, the authors now justify the CNN architecture used by describing it in more details and providing a reference. As the present paper is less about the particular architecture, this is more than fine. So (3) has been addressed. Thanks! Also, the authors switch from “deep” to “neural” at several occasions, which I agree is the better alternative. Thanks! So, (4) has been addressed,
although it is not clear to me, why a McNemar test was used and not some t-test. The McN
is for nominal values (as far as I know) and given that we here are interested
in accuracy etc. a t-test would have been more natural, in my opinion. I guess, the
authors considered the win-loss ratio, which is fine. Overall, I would accept the
test, but it would be nice to have an explanation here. I leave this (whether this is important as well as if so the check of the revision) to the editor.

The only still remaining downside, as far as I see, is the still unclear protocol
for the first experiment, the evaluation of the CRF features. As the authors do not
touch upon the cross-validation setting, I read this part again. I noticed that
the neural embeddings are learned on the complete dataset. This is unfair as
the embCRF has now more knowledge of the dataset as the standard one. In turn, the
results in Table 3 have to be better justified. Is is because of using the
full training set for embCRF? This has to be clarified before publication. As the same
features have been used in all other experiments, the other experiments should be checked,
too. Or, as asked for in my previous review (sorry if this was not clear) the authors
should justify the experimental setup. Generally, a significance test should be run everywhere, but I leave this decision to the editor. So contrary to what the authors
argue, all classifiers has not been trained on the same data (at least potentially).

To summarise, (2-4) have been addressed well. Thanks! (1) has been addressed partly
and raised a new, more refined issue.

Review #2

Anonymous submitted on 20/Apr/2017

Suggestion:
Accept

Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

All the requests and issued I had raised have been addressed.

Review #3

By Bettina Berendt submitted on 25/Apr/2017

Suggestion:
Minor Revision

Review Comment:

The revision is clearer than the first version, and the authors have addressed most of my concern. However, the request to add an error discussion and limitations section was only addressed in a very superficial way, and these are key components of good scientific work.

The error discussion is buried somewhere in the paper, it is very short and sloppily described. Clearly, the authors regard the achievements of their method as much more noteworthy and deserving of detailed description than the errors. This is a misunderstanding of the role of errors: they guide you towards a better understanding of your method and ultimately towards scientific progress. Error discussions are also important to readers in order to allow them to strive for scientific progress.

There is also some handwaving in the contents of the error discussion. For example, is the fact that different products are described together really a "data error"? Maybe it is just the way some websites organise their contents, and they do this for a reason?

A limitations section is missing, but needed. (All methods have limitations. Again, one can learn from them.)

The future work section therefore remains quite shallow and consists mainly of claims that the method can be applied to all kinds of other purposes.

Please provide these missing elements.

Minor issues: Please proofread carefully. Especially towards the end, there are a number of grammatical mistakes and stray words.

Log in or register to post comments
7079 reads

Main menu

Editorial Board

Syndicate

A Machine Learning Approach for Product Matching and Categorization

Tracking #: 1571-2783

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

A Machine Learning Approach for Product Matching and Categorization

Tracking #: 1571-2783

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles