Review Comment:
In this paper the authors present Orbis, a system and framework for evaluating and benchmarking information extraction pipelines on textual data, with a focus on content extraction, named entity recognition and linking, and slot filling. The paper addresses a relevant problem and concrete needs and use cases, and the overall work is sufficiently motivated and framed in the context of the state of the art. Overall, the paper is well structured and generally well written. The tool and related material is available on GitHub, and meets all the requirements for the submission. However, I would have expected a publicly available instance of the system (at least for demo/illustrative purposes).
As a "full paper" (as this work is submitted), this work does not include strong research results and contributions. On the other side, as a "Reports on tools and systems" it would probably not yet meet the maturity and impact requirements. Nevertheless, the work has a good potential and can be improved through a careful revision.
This work clearly builds on and extends authors' previous work. While a previous paper is only briefly mentioned in the submitted work (sec 6.1, close to the end of the paper), I believe this should be explicitly mentioned in the introduction, clearly stating how previous work was extended and which are the new, original contributions of the new submission wrt previous papers.
Although related work, presented in section 2, does not explicitly include a comparison between the mentioned tools and Orbis, this is then provided in section 5.1. It would be useful for the reader to add in section 2 a pointer to such a comparison (i.e., mention at the end of sec. 2 that a detailed comparison with other tools is provided in sec. 5.1). Otherwise, as it is, the related work section reads as a description of other approaches without any hint or statement on how they compare to your framework.
The attempt made in section 3 to formalise and systematise the definitions of the four information extraction tasks and related metrics is valuable and provides the background to understand the challenges that Orbis aims at addressing. However, my overall impression is that there is an "overuse" of notation, in particular in terms of subscripts/superscripts that do not help readability and are not used consistently. Most of the formulae introduced in section 3 are not functional to the rest of the paper (they are not used nor referenced elsewhere) or represent well-known metrics or measures (precision, recall, F1, etc.). I therefore suggest to carefully review section 3 (as well as any other part of the paper that may be impacted) so that the used notation is consistent and correct, also taking into account the following comments and suggestions.
* In all definitions and formulae, review the usage of the "i" subscript. In some cases it seems to refer to the i-th document (document d_i in the definition of CE; page 4, second column, line 11) and then appears in all subsequent "entities"; my impression is that you can simply refer to a document "d" and remove all those "i"s. In section 3.2 basically everything has an "i", so that the meaning is unclear: either I misunderstand the symbols, or the same "i" cannot be used to index the string extracted from the document, the surface form of the entities, the entity type, the variables for the start/end position. Similar observations hold for the definition in section 3.3.1, where "i" seems to refer to the i-th entity in a KG, but is also used for the surface form (in a document d, thus no longer d_i) and for the variables for the start/end position of the mention.
* Make explicit the intended meaning of other superscripts used: in section 3.1, it seems that "r" stands for "relevant" and "g" for "gold"; in section 3.3.6 my understanding is that "c" stands for "corpus" and "s" for "system" (although this is not clear when those symbols are first introduced).
* In formula (3), the denominator should be |T_i^r|
* Are formulae (6) and (7) really useful?
* Section 3.2 on named entity recognition does not go beyond the problem statement. What about evaluation metrics, as discussed for the other tasks? Are there any benchmarking issues or other challenges, as discussed for NEL?
* In general, where possible, providing an example for each of the definitions would be useful and would improve the understandability.
The discussion on the architecture in section 4 should be reviewed and improved, maybe starting from the authors' previous work ("Odoni et al., On the Importance of Drill-Down Analysis for Assessing Gold Standards and Named Entity Linking Performance, SEMANTiCS 2018 / Procedia Computer Science 137:33-42, 2018"). In that work, for example, you have a clear figure with an architectural diagram, which is missing in your submission and would greatly improve the understandability of the system. In particular, section 4.1 mixes different perspectives (functional pipeline, implementation details, details on the visual interface) without any clear structure. For example: how do the two viewing modes (standard and dark, mentioned at page 8, first column, lines 17-18) relate to the pipeline? Why is this detail concerning the UI mentioned there? Similarly, after presenting the three stages, the possibility to install external packages using Python is described...but you never even mentioned that the system is implemented in Python. I suggest that you restructure the section so that design choices, the architectural model (with a diagram) and implementation details are clearly presented in a consistent way.
In the discussion in section 5, the clear task-oriented structure (CE, NER, NEL, SF) of section 3 is a bit lost. The focus seems mainly on NEL: error analysis and lenses are basically targeting this task only. I understand from section 3 that the main challenges are related to NEL, but it is unclear if and how the system can concretely help in the case of CE, NER and SF, beyond having side-by-side views of the gold standard and the annotator(s) results (which per se is still valuable). Section 5.1 and Table 3 on FAIR principles are not very convincing and additional details are needed. In particular:
- Which community-based, open vocabularies are used wrt I2 in the "Interoperable" section (page 11, line 21)?
- What are the TAC standards mentioned for R1 in the "Reusable" section (line 24)?
- Concerning R1.3, I do not understand the meaning of "Covers a superset of domain-relevant data".
- I'm also a bit surprised to see JSON mentioned as "formal, accessible, shared, and broadly applicable language for knowledge representation" (I1).
Having some IDs and using HTTP and JSON is not enough to truly meet the FAIR principles.
Please carefully rethink your adherence to the FAIR principles and if you believe you really implement them, provide a clear and convincing explanation for each of them.
Although evaluating a system like Orbis is not easy, the paper does not include any evaluation at all. The paper includes several statements on the added value of the system (e.g., "These tools aid experts in quickly identifying shortcomings within their methods and in addressing them."; "Orbis significantly lowers the effort required to perform drill-down analysis which in turn enable researchers to locate a problem in algorithms, machine learning components, gold standards and data sources more quickly, leading to a more efficient allocation of research efforts and developer resources."), but there is no evidence supporting them. Who has been using Orbis beyond the authors? Have you collected any feedback from those "experts"? Have you performed any kind of evaluation (even a simple System Usability Scale questionnare)? The impact section (6.1) does not help in addressing these questions. While it is intuitively true that the system can help in addressing some issues, strong or vague statements ("significantly lowers the effort...", "quickly identifying...", "more efficient allocation...") should be supported by measurable criteria or other kind of evidence. Do you at least have a plan for its uptake and systematic evaluation?
---
Typos and minor comments
- page 1, second column, line 44: the parenthesis before "e.g." is not closed (but it can be replaced by a comma)
- page 3, second column, line 39: "DS errors" - DS as abbreviation for dataset is first mentioned on page 10
- page 8, second column, line 49: "wrapper" --> "wrappers"
- Table 3: "Accesible" --> "Accessible"
- page 17, first column, line 47: the project's url
|