Editorial Board

Editors-in-Chief
Krzysztof Janowicz

Managing Editors
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Mark Gahegan
Aldo Gangemi
Anna Lisa Gentile
Rafael Goncalves
Dagmar Gromann
Armin Haller
Aidan Hogan
Katja Hose
Eero Hyvönen
Sabrina Kirrane
Agnieszka Lawrynowicz
Freddy Lecue
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Christoph Schlieder
Stefan Schlobach
Oshani Seneviratne
Cogan Shimizu
Ruben Verborgh
GQ Zhang

Former Editors-in-Chief
Pascal Hitzler

Editorial Assistants
Sanaz Saki Norouzi

Syndicate

Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms

Submitted by Carlos Badenes-... on 06/16/2019 - 02:41

Tracking #: 2239-3452

Authors:

Carlos Badenes-Olmedo

José Luís Redondo-García

Oscar Corcho

Responsible editor:

Guest Editors Semantic E-Science 2018

Submission type:

Full Paper

Abstract:

Searching for similar documents and exploring major themes covered across groups of documents are common actions when browsing collections of scientific papers. This manual, knowledge-intensive task may become less tedious and even lead to unforeseen relevant findings if unsupervised algorithms are applied to help researchers. Most text mining algorithms represent documents in a common feature space that abstracts away from the specific sequence of words used in them. Probabilistic Topic Models reduce that feature space by annotating documents with thematic information. Over this low-dimensional latent space some locality-sensitive hashing algorithms have been proposed to perform document similarity search. However, thematic information is hidden behind hash codes, preventing thematic exploration and limiting the explanatory capability of topics to justify content-based similarities. This paper presents a novel hashing algorithm based on approximate nearest-neighbor techniques that uses hierarchical sets of topics as hash codes. It not only performs efficient similarity searches, but also allows extending those queries with thematic restrictions explaining the similarity score from the most relevant topics. Extensive evaluations on both scientific and industrial text datasets validate the proposed algorithm in terms of accuracy and efficiency.

Full PDF Version:

swj2239.pdf

Previous Version:

Efficient Exploration of Scientific Articles using Topic-based Hashing Algorithms

Tags:

Reviewed

Decision/Status:

Solicited Reviews:

Click to Expand/Collapse

Review #1

By Anita de Waard submitted on 25/Jun/2019

Suggestion:
Accept

Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Review #2

Anonymous submitted on 08/Jul/2019

Suggestion:
Accept

Review Comment:

Compare to the previous version, reproducibility is improved as more implementation details are added, including the libraries they use for the experiment.

Some new figures prove that the precision obtained by the algorithm is indeed robust to the dimension(number of topics). By the way, the authors may try the non-parametric version of LDA in future work, which infers the number of topics automatically.

Although I agree that the proposed method offers some nice properties in which other methods do not contain, more comparisons between different algorithms can show the proposed method at least does not sacrifice performance in other aspects such as precision.

In general, the authors address most questions I mentioned in the last review.

Review #3

By Daniel Garijo submitted on 14/Sep/2019

Suggestion:
Accept

Review Comment:

The authors have thoroughly answered all my comments, and I think the paper should be accepted as part of the journal.

I leave just small comments that would be great to address in the final version of the paper:

- I recommend doing another proof read, because some of the changes introduced have typos. For example "Trained in corpora with different parameter" --> parameters.
- Try to use avoid using "very" that often. For instance, instead of "very difficult" you could use "challenging". There are other adjectives that help you quantify your text :)
- In the second paragraph of the intro, remove "Therefore". I think this work is not a consequence of the problem, but a contribution to address it.
- Footnote 4 should become a release, and if possible with its corresponding citation (e.g., Zenodo). The reason I ask this is because the dataset and pre-trained materials may change in the future, but the ones that you have produced in this paper are concrete.
- Finally, please try to reduce the amount of acronyms used in the paper. It's hard to follow, specially when some of them overlap with other acronyms in the state of the art (e.g., ANN stands for Artificial Neural Networks)

Log in or register to post comments
6651 reads

Main menu

Editorial Board

Syndicate

Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms

Tracking #: 2239-3452

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms

Tracking #: 2239-3452

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles