Editorial Board

Editors-in-Chief
Krzysztof Janowicz

Managing Editors
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Mark Gahegan
Aldo Gangemi
Anna Lisa Gentile
Rafael Goncalves
Dagmar Gromann
Armin Haller
Aidan Hogan
Katja Hose
Eero Hyvönen
Sabrina Kirrane
Agnieszka Lawrynowicz
Freddy Lecue
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Christoph Schlieder
Stefan Schlobach
Oshani Seneviratne
Cogan Shimizu
Ruben Verborgh
GQ Zhang

Former Editors-in-Chief
Pascal Hitzler

Editorial Assistants
Sanaz Saki Norouzi

Syndicate

Literally Better: Analyzing and Improving the Quality of Literals

Submitted by Wouter Beek on 02/27/2017 - 02:23

Tracking #: 1579-2791

Authors:

Wouter Beek

Filip Ilievski

Jeremy Debattista

Stefan Schlobach

Jan Wielemaker

Responsible editor:

Guest Editors Quality Management of Semantic Web Assets

Submission type:

Full Paper

Abstract:

Quality is a complicated and multifarious topic in contemporary Linked Data research. The aspect of literal quality in particular has not yet been rigorously studied. Nevertheless, analyzing and improving the quality of literals is important since literals form a substantial (one in seven statements) and crucial part of the Semantic Web. Specifically, literals allow infinite value spaces to be expressed and they provide the linguistic entry point to the LOD Cloud. We present a toolchain that builds on the LOD Laundromat data cleaning and republishing infrastructure and that allows us to analyze the quality of literals on a very large scale, using a collection of quality criteria we specify in a systematic way. We illustrate the viability of our approach by lifting out two particular aspects in which the current LOD Cloud can be immediately improved by automated means: value canonization and language tagging. Since not all quality aspects can be addressed algorithmically, we also give an overview of other problems that can be used to guide future endeavors in tooling, training, and best practice formulation.

Full PDF Version:

swj1579.pdf

Previous Version:

Literally Better: Analyzing and Improving the Quality of Literals

Tags:

Reviewed

Decision/Status:

Solicited Reviews:

Click to Expand/Collapse

Review #1

Anonymous submitted on 13/Mar/2017

Suggestion:
Minor Revision

Review Comment:

The authors have addressed most of the comments included in the previous review and have provided explanations. The taxonomy is improved with the clear separation of the inherent data-specific quality aspects from the RDF processor related aspects such as “Unimplemented”. Readability of most sections of the paper is improved in the new version and the new examples included in the paper help to understand its content better.

Comments:

* DQV work has completed, so the corresponding text can be updated accordingly.
* For completeness and consistency, it might be better to add the descriptions for categories “well-specified” and “canonical” for datatyped literals and “consistent” and “well-specified” for language-tagged strings. For instance, these could one phrase descriptions similar to “valid” or “registered” categories.
* Because unsupported is not a category in the taxonomy may be it can be removed from the text to be clear where it says “Invalid or unsupported”.
* According to Section 4.1, the example of “semantics”xsd:string does not fit “underspecified” of the language tagged string tree, isn’t it? Doesn’t it belong to the datatyped literals tree?
* The last paragraph of the section 6.3 is a bit unclear.
* When giving statistics about strings in page 14, it would be interesting to include what percentage of the 2.26 billion language tagged strings had the optional language tag.
* In general, the sections about quality assessment (e.g., section 6.2 and 6.3) have still a lot of room for improvement with respect to clarity and details.

Previous comments which are still relevant:
* For single word literals, aren’t there more efficient ways of doing dictionary look-ups rather than looking for a rdfs:seeAlso property in the resources returned by the Lexvo API? For instance, if I look for a word such as http://www.lexvo.org/page/term/eng/Canonicalization your approach will give a false negative. Do you know the precision and recall of this approach? Also it looks strange that the ALD libraries that are described in the paper which high precision are not used here. It would be good to motivate why the current approach was chosen.

Minor issues:
* AbstractQuality -> Abstract Quality (page 1)
* calculating the ration -> calculating the ratio (page 12)
* a these are supported -> these are supported (page 15)

Review #2

By Heiko Paulheim submitted on 23/Mar/2017

Suggestion:
Minor Revision

Review Comment:

I appreciate the thorough review made by the authors, and the answer provided.

My main remaining concern is the identification of datasets. I can see that this is a tricky issue, which cannot be fully resolved, but some heuristics could be applied, e.g.,
1) using the pay level domain as a proxy for a dataset
2) using datasets as provided in the LOD cloud dataset, and falling back to (1) if there are none

The reason why I am so picky about this is my original remark about the statement made in section 7.1 (i.e., "defining the DBpedia type IRIs would solve the vast majority undefined datatype IRIs"). Without a clearer break down of the findings, it is unclear whether this finding comes from (a) a heavy bias of the evaluation set towards DBpedia or (b) the fact that this problem only occurs in DBpedia.

Further comments of my initial review that have remained unaddressed:
* a justification why documenst with a score of less than 40% have been chosen for manual inspection in section 6.3
* a statement whether the distribution of omitted language tags is the same or different than the distribution of explicitly given language tags

However, I am confident that those issues can be fixed.

Log in or register to post comments
9757 reads

Main menu

Editorial Board

Syndicate

Literally Better: Analyzing and Improving the Quality of Literals

Tracking #: 1579-2791

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

Literally Better: Analyzing and Improving the Quality of Literals

Tracking #: 1579-2791

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles