LinDA - Linked Data for SMEs

Tracking #: 998-2209

Authors: 
Spiros Mouzakitis
Judie Attard
Robert Danitz
Lena Farid
Eleni Fotopoulou
Michael Galkin
Barbara Kapourani
Fabrizio Orlandi
Dimitris Papaspyros
Michael Petychakis
Andreas Schramm
Anastasios Zafeiropoulos
Norma Zanetti

Responsible editor: 
Oscar Corcho

Submission type: 
Tool/System Report
Abstract: 
Linked Data is an active research field currently; new ideas, concepts, and tools keep emerging in quick succession. In contrast, relatively little activity is being seen towards aspects like ease of use and accessibility of tools for non-experts to promote Linked Data at SME level. This is a sign for the still developing maturity of said research field, and it is the motivation and starting point of the LinDA (“Linked DAta”) project. Concepts and components of an integrated framework of tools will be presented, where both Linked Data provision and consumption are covered. It will be shown how these tools can be combined and employed to carry out various activities, whose scopes range from single actions to holistic workflows. The usability of these workflows will be demonstrated along concrete pilot application scenarios.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vanessa Lopez submitted on 06/Mar/2015
Suggestion:
Reject
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

---------

This paper describes the project Linda, which aim to address the difficulties that non-expert users have in order to use Linked Data in SMEs.

The topic of the paper is of interest and highly relevant to the community. The authors rightly point out that the use of Linked Data in practise is hamper by various issues:
1) lack of workflows and convincing use cases
2) scarcity of public data sources and useful vocabularies
3) the dinamicity and number of different tools
4) Initial costs and lack of user friendliness.

Authors state that their approach aims to address these difficulties, however they should explicitly mention what are the actual contributions more specifically.

While I am not too convinced about argument 3, I do strongly agree with the importance of 1) and 4). Nonetheless, there is extensive related work in the area, in particular for issue 4). Although we are still far from solving these issues, the related work should be extended to reflect on this.

Only some links (as footnotes - not references) of related works have been included in the Related Work section, and they are basically limited to approach that transfer semi-structured data (CSV) into RDF and OpenRefine.

Several works in the context of open data and smart cities have been built to semantically lift and query heterogeneous data, in particular for publishing and consuming Linked Data considering both the criteria of fit for use (for non-expert users) and return on investment (cost efficiency) - see [1] to [10] to name a few - . The authors should compare their work with respect to them otherwise the proposed novelty and contributions are not clear.

Furthermore, the authors have ignored all the state of the art relating to natural interfaces to Linked Data (semantic search, question answering, faceted and exploratory systems, and so on). Similarly, the state of the art in analytics only lists the existing open source frameworks, but they do not compare themselves with any work that benefits from using a combination of semantics and analytics.

Section 3 is nicely written. The authors clarify their purpose is to select and improve upon existing tools to construct an integrated tool chain, focusing in non-expert users. The authors state LinDA consists on a number of tools integrated in a workbench. However, while the rest of the sections presents the components with a brief description ( a bit as black boxes), the authors failed to include references to which tools were actually used (besides Sparqlify) and the paper fall short to prove the claims state before, are these components developed for this project? have they been evaluated?
I understand a system report should be brief and pointed, focusing on the capabilities rather than a detailed description of each component, but without references to these components or a brief description of the algorithms, it is not clear how actually the system works.

Having say so, section 3 and 4 would benefit from a running example to showcase how the system works and what analytics does it offer, e.g., what are the insights a user could discover using the current tool?

Section 4 presents the role of the components integrated in the workbench, namely: the transformation engine, the vocabulary and transformation metadata, the publication and consumption tools, the visualisation tools, the analytics and data mining, and architecture properties of the LinDa workbench.

The authors state semi-structured private and open data is transformed into semantically enriched data, which data did you use? what was the size? how many datasets? how many triples did you obtain? what was the accuracy? how many mappings did you require in the mapping file? non quantitative details are given here or in the use cases or anywhere in the paper.

The component 4.1 and 4.2 (the transformation engine and vocabulary repository) are the ones for which more technical details are given, however besides describing its capabilities I am missing a discussion on what are the limitations. For example what if a row contains more than an entity (e.g., object properties)? how do you extract datatypes? A user supervised method is used, what is the cost associated to this? is this a approach fit for a non expert user? What vocabularies / catalogues did you use?. For a large corpus I would expect usability issues, i.e., users having to disambiguate between a very large number of options in a drop down menu.

“Suggesting a list of entities that may be suitable in the specific context” -> how? what is the context? do you need to create indexes to do this?

“community ratings and other vocabulary statistics” -> such as?

Is the multi-linguality feature implemented? or a nice to have capability? if so, what tools and resources did you use?

“ a visualisation of the RDF graph allow users to better perceive each vocabulary” -> who are the users? is this based on any evaluation? in my experience non expert users do not find graphs too usable…

Section 4.3 rather than a component it describes a simple workflow for publication and consumption, it consists on 4 steps: the conversion to RDF( explained in previous section), a SPARQL query tool for expert users that can write SPARQL, and a query builder tool for non-expert users based on drop down menus. This tool is not described here, while I understand this may be out of the scope and there may not be a reference if the tool hasn’t been published, some important information is missing such as: what is the coverage on the types of queries this tool can answer? . No evaluation results are provided for this tool in terms of performance (precision , recall), coverage or more importantly usability for a non expert user.

Section 4.4 refers to the visualisation tool, which is responsible for creating plots, charts, etc .. and also it recommends the most suitable visualisation. Here again, no enough technical details are given, is this based in any existent tool (e.g., IBM Many Eyes?), you mention a visualisation ontology, how did you create this one? can you extend on this? is this ontology available publicily anywhere?

Section 4.5. on analytics and data mining would strongly benefit from an example in which analytics are successfully used - e.g., to obtain a hide pattern in the previously isolated data, as well as an example of a designed specialised custom workflow or preselected queries (how many? what’s the coverage?). To me , the inclusion of analytics into this pipeline would be a great novelty of this approach. Unfortunately, not enough details are given to understand or being able to replicate how this component works. Also, no discussion is given on what is the cost of properly configuring this component, or what are the parameters that need to be configured.

Section 4.6 doesn’t seem to add much to the paper. Section 5 goes again through the three steps of the workflow. This section here seems redundant. I am not convince on how the system can handle updates and deletes, do you have to re-index? how do you handle different users modifying the data in different ways?

The two scenarios presented, while complex and compelling are not evaluated in any way, they read more as motivating envisioning scenarios rather than concrete pilot apps. There is no evaluation or usability study to substantiate the claims. Thus, I am not convince these use cases has actually been used to test an existent prototype. IF so, this section needs to be rewritten, even if only one use case is partially covered (instead of two), to make clear how the different workflows were used to solved each of this issue step by step, how they are selected? what was the cost ? what is the usability for non experts with respect to more familiar users? and more details regarding the accuracy of the selected resources, interlinking, analytics etc. What about scalability and performance considerations? could you find a correlation to the example questions, such as between pollution levels and diseases?, if not yet, what are the (more simple) queries the system can help for now? what further work is needed to answer these queries? what are the current capabilities and limitations? the scope of the current work isn’t clear after reading these scenarios.

Furthermore, the impact on the ease of use and accessibility of these tools and pipeline for non-expert users, without an evaluation, user study, or at least a discussion based on the experience / learned lessons can hardly be proven. These scenarios appear to provide real-world scenarios and problems; however, it is unclear if the system can actually tackle them

Summing up

The architecture's description is rather high-level, this may be expected given the length and scope of the paper, but the following limitations require a substantial review of this paper: 


- Has the approach been fully implemented? For a report on tools and systems is strongly encourage by the guidelines that is accessible on the Web, is that the case, if so please provide a link to the system, or otherwise a video demo. Otherwise, some additional screenshots would further strengthen the paper showing what the final system looks like.

- While the topic of the paper is very relevant, and the general architecture of the proposed solution interesting, the authors provide too little technical details to make this paper scientifically interesting.

- Little awareness of state of the art. It is also not clear what is the major contribution of this paper. The authors tackle very complex problems, for which there are many subproblems that are not yet properly answered in the literature. Furthermore, a proper comparison of this approach with a extended related work is needed.

- The claims are not supported. There is not an evaluation setup or usability study, without this is not possible to say if the proposed approach solves the issues of providing a user friendliness while reducing costs .

- The conclusion and discussion would benefit from some key lessons learnt and numbers: where were the main challenges when loading the data into the system? what about scalability issues? what is the size of the datasets used?

- There is potential impact and the scope of the work is very ambitious, but there is not convincing evidence of the maturity of the system. The capabilities are not proven and the limitations are not described or discussed

- The novelty of this approach according to the authors is to provide way of integrating everything together, in order to solve a real world problem. However, it isn’t clear was what the effort of adapting this system for the proposed use cases

- There is a gap between the presented minimalistic workflow and the use cases descriptions. I can’t see how they can be used to tackle the scenarios. The authors state on the abstract that the usability of this workflow will be demonstrated along concrete pilot apps, but this isn’t show or proven in any part of the paper. The paper reads more like a visionary paper or a nice project description rather than a mature system description.

[1] Raimond, Y., Ferne, T. The BBC World Service Archive Prototype. Semantic Web Challenge at ISWC 2013
[2] Kotoulas, S., Lopez, V., Lloyd R., Sbodio, M.L., Lecue, F., et al. SPUD- Semantic Processing of Urban Data. Journal of Web Semantics, 2014
[3] Lopez, V., Stephenson. M., Kotoulas, S., Tommasi, P. Finding Mr and Mrs Entity in the City of Knowledge. Hypertext 2014
[4] Lopez, V., Kotoulas, S., Sbodio, M. et al. QuerioCity: A Linked Data Platform for Urban Information Management. ISWC 2012
[5] Maali, F., Cyganiak, R., Peristeras, V.: A publishing pipeline for linked government data. ESWC 2012
[6] Scharffe, F., Atemezing, G., R., T., Gandon, F., et al. Enabling linked-data publication with the datalift platform. In Workshop on Semantic Cities, AAAI’ 2012
[7] Skjaeveland, M., Lian, E., Horrocks, I. Publishing the Norwegian Petroleum Directorate’s FactPages as Semantic Web Data. ISWC 2013
[8] Gossen, A., Haase, P., Hütter, C., Meier, M., Nikolov, A., Pinkel, C., Schmidt, M., Schwarte, A. The Information Workbench – A Platform for Linked Data Applications. Submitted to Semantic Web Journal: http://www.semantic-web-journal.net/content/information-workbench-platfo...
[9]Ding, L., Lebo, T., Erickson, J.S., DiFranzo, D., et al.: A portal for linked open government data ecosystems. Web Semantics 9(3), 2011
[10] See also the Chicago data portal, where users can generate data from datasets and visualise it in different ways: https://data.cityofchicago.org/

Review #2
By Boris Villazon-Terrazas submitted on 18/May/2015
Suggestion:
Major Revision
Review Comment:

This paper presents a set of tools integrated within a LinDA framework. The authors claim this proposed framework is targeting SMEs. The technology presented covers data provision and consumption tools.

Some comments:
- I'm not sure if we can consider all SMEs as non-expert users, for sure there are some SMEs with Linked Data and Semantic technologies background
- The authors refer to "SME problem" along the paper, however, I cannot see well stated this problem within the paper. There are no facts described in the paper about this; and I think this is the core matter of the paper. What is the problem? Where is pain?
- The related work section presents a list of Linked Data technologies. This is not a exhaustive nor complete list. I think authors should say this explicitly. We have already nine years of Linked Data, and for sure we have more things to list.
- The related work section also describes a few projects dealing with Linked Data, but what about companies using or relying on Linked Data, such as the Semantic Web Company. I think this should be included in the related work, I mean all the Linked Data related companies.
- What was the criteria for including a particular tool/technology within the LinDA framework? I think there should be a kind of evaluation process taking into account features like scalability, usability, etc; or you only select the tools of you research lab.
- what was the integration methodology followed for the integration of the tools/technologies.
- I think you should include a particular task for modelling in your framework, because it is not a trivial task. Yes, we can reuse available and existing vocabularies, but there are some particularities in every case; and therefore we need to know how to extend, and refine vocabularies.
- I think one key contribution of the paper is the Analytics and Data Mining, however within the paper this is only mentioned a "feature wanna be" ... there is no solid research or development in this area. I think this is one of the big problems of Linked Data, how to really consume and exploit Linked Data, once we have integrated several datasets, how to take advantage of that, how to mine and extract knowledge from that.
- I was expecting some real world applications scenarios and not only technical description of those. I think you should provide real uses cases describing lessons learned and how did you actually put linked data to work within SMEs.

Minor comments:
- the link http://sparqlify.org/wiki/SML is not working
- R and Pentaho have a plugin for interacting with Linked Data, I don't recall the names right now
- I would like to see references/pointers to RDF2Any API, SPARQL Query Tool, Query Builder Tool,
- I would like to see more information about your recommendation engine
- Webbased -> web based