Abstract:
We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Semantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Semantic Web resources (languages/ontologies/knowledge-bases/tools) to improve Information Extraction, and/or using Information Extraction to populate the Semantic Web. In more detail, we focus on the extraction and linking of three elements: entities, concepts and relations. Extraction involves identifying (textual) mentions referring to such elements in a given unstructured or semi-structured input source. Linking involves associating each such mention with an appropriate disambiguated identifier referring to the same element in a Semantic Web knowledge-base (or ontology), in some cases creating a new identifier where necessary. With respect to entities, works involving (Named) Entity Recognition, Entity Disambiguation, Entity Linking, etc. in the context of the Semantic Web are considered. With respect to concepts, works involving Terminology Extraction, Keyword Extraction, Topic Modeling, Topic Labeling, etc., in the context of the Semantic Web are considered. Finally, with respect to relations, works involving Relation Extraction in the context of the Semantic Web are considered. The focus of the majority of the survey is on works applied to unstructured sources (text in natural language); however, we also provide an overview of works that develop custom techniques adapted for semi-structured inputs, namely markup documents and web tables.
Comments
Review
This is a well-written survey of information extraction techniques that involve the semantic web. The survey is clearly organized and very comprehensive.
One challenge of a survey like this, where relevant work has been done by researchers from many different communities, each with their own terminology, is to group and label the subject matter in a meaningful and accessible way. The authors’ groupings on the whole make sense (though there were a couple of distinctions that I’m not sure are important from a practical perspective), and they use footnotes to list common synonyms for each major grouping’s label.
Each major section contains a list of currently outstanding research questions in the sub-field discussed in that section. These tend to be someone repetitive across the different sub-fields; however, they do highlight issues that I frequently find are overlooked, such as the need for more fine-grained evaluation approaches that avoid treating techniques as monolithic black boxes, and the need to evaluate runtime and memory usage in addition to traditional performance metrics such as F-measure.
The discussion of overall trends in the field near the end of the paper is well done and very useful.
The paper contains an appendix that overviews traditional NLP and information extraction techniques, allowing it to stand alone for its intended audience (Semantic Web researchers and practitioners). I originally thought the references given for more comprehensive surveys of this material were somewhat dated, but in looking further myself I couldn’t find anything more current.
Of minor note, Table 2 indicates that 89 papers were surveyed but the conclusion says, “In terms of the 109 highlighted papers in this survey…” which seems inconsistent.
Response to Open Review by Michelle Cheatham
The authors thank Michelle for her comments. Indeed, integrating the terminology used in different papers from different areas was a significant challenge when preparing the survey, both in terms of finding papers, and ultimately in writing the text. We have tried to strike a good balance between being faithful to the terminology used in the literature while adopting a coherent, self-contained terminology in the survey text.
Regarding the open questions that we present, while we acknowledge there is repetitiveness across the sections (where issues like evaluation procedures, scalability, etc., appear in all sections), we ultimately decided that this was a necessary compromise to have an independent list of questions per section (especially since there are important nuances to these common themes that are specific to each section).
Regarding Table 2, these 89 papers include only those works accepting text as input (the main focus of the survey). A further 20 papers that consider semi-structured inputs are added later in Section 5. We add a footnote to the text describing Table 2 to clarify this issue in the camera-ready version.