From hyperlinks to Semantic Web properties using Open Knowledge Extraction

Tracking #: 1195-2407

Authors: 
Valentina Presutti
Andrea Giovanni Nuzzolese
Sergio Consoli
Aldo Gangemi
Diego Reforgiato Recupero

Responsible editor: 
Guest Editors EKAW 2014 Schlobach Janowicz

Submission type: 
Full Paper
Abstract: 
Open information extraction approaches are useful but insufficient alone for populating the Web with machine readable information as their results are not directly linkable to, and immediately reusable from, other Linked Data sources. This work proposes a novel paradigm, named Open Knowledge Extraction, and its implementation (Legalo) that performs unsupervised, open domain, and abstractive knowledge extraction from text for producing directly usable machine readable information. The implemented method is based on the hypothesis that hyperlinks (either created by humans or knowledge extraction tools) provide a pragmatic trace of semantic relations between two entities, and that such semantic relations, their subjects and objects, can be revealed by processing their linguistic traces (i.e. the sentences that embed the hyperlinks) and formalised as Semantic Web triples and ontology axioms. Experimental evaluations conducted on validated text extracted from Wikipedia pages, with the help of crowdsourcing, confirm this hypothesis showing high performances. A demo is available at http://wit.istc.cnr.it/stlab-tools/legalo.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Yingjie Hu submitted on 11/Nov/2015
Suggestion:
Accept
Review Comment:

The authors have addressed my comments successfully. The current version is acceptable if the following minor issues can be revised:

1. Section 3.4: "These labels constitute the basis for automatically generating a set of RDF triples that can be used for semantically annotating the hyperlinks included in s, additionally these set of triples provides a (formalised) summary of s." This should be two sentences: "... included in s. Additionally, these set of triples..."

2. Page 10, on the lowerright column: "The latter containing information about the disambiguated senses of the verbs, i.e. frame occurrences, used in s." should be "The latter contains information..."

3. The clarity of Figure 5 could be improved with higher dpi.

4. The font sizes of the tables should be consistent. For example, the font size of table 4 is much larger than table 2 and table 6.

Review #2
By Marieke van Erp submitted on 20/Nov/2015
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This manuscript is a revised version of a previously submitted paper. The manuscript has much improved since the previous version, but there are still a few things that could be improved.

In Section 1, the paper status "In the Semantic Web era such factual relations should be expressed as RDF triples" and later on "Knowledge Extraction for the Semantic Web should instead include an abstractive step". I think these statements are too strong. There is a lot of information extraction happening in the natural language processing community, that uses information from external resources such as VerbNet and DBpedia and they often do not use RDF, but this does not mean that their results are not useful. Similarly, there are perfectly good reasons to not include an abstractive step in knowledge extraction, if one wants to stay as close to the original text as possible. I think it would be better if these statements would be rephrased most modestly.

Section 1 also gives the impression that the abstractive results normalise extracted relationships as well (e.g. in the example "John Stigall received a BA", or "John Stigall obtained a BA" the extracted relationships would be normalised to the same relationship expression) , or that there is a strong preference to do so. However, from section 3.4 I gather that the normalisation (called reconciliation here) is left to future work. The authors mention WordNet in the paper, but they do not seem to use it. Have any experiments been carried out? This is something that is currently done by other researchers, and for example links to VerbNet and Framenet can be obtained fairly easily too. See for example the Predicate Matrix, developed by researchers in the Basque Country: http://adimen.si.ehu.es/web/PredicateMatrix

Also regarding the resources used, there is some motivation in this version of the paper of why certain resources were used, but it is not entirely clear to me why VerbNet was preferred over Framenet. VerbNet 3.2 describes 8,537 verbs, while FrameNet describes 13,370 lexical units (also including nouns). Then there is also Propbank which contains more generalised frames, but could also be useful here. Many SRL systems based on PropBank can also deal with passive forms, which is currently something FRED cannot deal with yet. Also, the motivation for Wibi doesn't really explain why it is better suited than DBpedia or YAGO. What kind of empirical test was conducted? How was this evaluated?

In summary, I think the paper could still use some polishing up with respect to the phrasing of its contributions and relation to background work and motivation for particular design choices.

Some suggestions:
- You may want to have a look at http://groundedannotationframework.org/ for representing the link between the textual source and the generated triples
- To keep track of where a property or triple originated you may want to look into using http://www.w3.org/TR/prov-o/
- To scale up testing, could the LOD Laundromat be used instead of LOV?
- You may want to look at the NewsReader project, which is also doing Open IE and generating a large knowledge base from the extracted information for related work [disclaimer: I work on this project]

Textual remarks:
Section 1:
In the Semantic Web era such factual -> In the Semantic Web era, such factual
directly resolved on existing -> directly resolved through existing
Section 2 and Section 3:
please remove the parentheses around the OKE capabilities and Legalo six main steps lists. You may find the 'description' environment in LaTeX useful here.
Section 3:
However an example may be useful -> However, an example may be useful
Legalo design strategy -> The Legalo design strategy
Example 3.1 seems a bit weird to me "Joey Foster Ellis has published on The New York Times, and The Wall Street Journal". What does this mean? From Joey Forster Ellis' Wikipedia page, I gather that his work was featured in the NYT and WSJ, not that he wrote an article or did a piece about these newspapers.
Example 3.2 red:Elton_John -> fred:Elton_John
Section 4:
a Levenshtein metrics [32] -> a Levenshtein distance [32]
did not perform neither the relevant relation assessment nor -> did not perform the relevant relation assessment or
Section 5:
sentence s is an evidence of -> sentence s provides evidence for (2x)
the need of -> the need for
I'm not sure what the added value of the Task Unit IDs is in Table 3
Legalo ability to assess -> Legalo's ability to assess
Legalo performance is measured -> Legalo's performance is measured
that Legalo method -> that the Legalo method
Legalo ability of generating usable -> Legalo's ability to generate usable
Being paraphrasing a highly -> As paraphrasing is a
Caption under table 4 can be wider
saying that Legalo design strategy -> saying that the Legalo design strategy
are the Pnew and Psw definitions really necessary in 5.1 and table 6? These make the text less readable.
Section 7:
serious limits of course -> remove 'of course'
Section 8:
lLegalo -> Legalo
Legalo are envisaged -> Legalo are envisioned

Throughout the text:
DBPedia -> DBpedia
Please be consistent in spelling, sometimes you use American spelling (e.g. modeling) but mostly you use British spelling (e.g. modelling)


Comments