Review Comment:
COLINDA is a new addition to the LOD cloud, containing geo-referenced data about scientific conferences and workshops from 2007-11, based on calls for papers sourced from WikiCfP and Eventseer.
The authors describe the data sources and the conversion process, and conclude with an example of usage. A key limitation identified is incomplete or missing geolocation information for a fair part of the dataset.
I have one key concern - the population of the dataset - the authors note that publication approval from Eventseer is still pending - is there a fallback option if this is not obtained? Also - they note "parts of WikiCfP data are still in processing stage" - where and what is the bottleneck in processing, and is it on WikiCfP or COLINDA?
The COLINDA website provides a map that can be used only to browse conference acronyms and location (city, country). Linking each point to the complete entry in RDF would be useful. Even using the acronyms on the map I was unable to retrieve information for any conference (I tried a few before giving up) from the forms and endpoint on the website, or using the URIs as indicated in section 2.5, other than the example provided on the website. However, trying the data dump from datahub works - I noticed the URIs are different (paper, COLINDA website and datahub).
Røst et al., have available on the SWJ website a paper under review: 'Eventseer: "Calls for Papers" as Linked Data'
for the first call for linked dataset descriptions
http://www.semantic-web-journal.net/content/eventseer-calls-papers-linke...
While I acknowledge this is not yet published, the current version of the submission and the data are publicly visible and very relevant to this work. What does COLINDA provide over other similar existing services, and especially the Eventseer linked dataset (seeing as it makes use of this data)?
Some information IS given later in the paper, about connections to Linked Science and Research 2.0 - if these are key they should be mentioned in the introduction. A brief summary of the requirements of these two fields which COLINDA satisfies would be useful.
************ SWJ Linked Dataset description requirements ************
* Name, URL, version date and number, licensing, availability, etc.
Licensing information for the source data is provided - however, what license COLINDA will be provided under is not, nor date nor version information.
* Topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.
The authors state that the initial intention of COLINDA was to provide "tag based identification system for scientific events" - did this change? If so, what to? If not, consider "primary" (assuming this was what was meant) rather than "initial". Also, what exactly does "tag based identification" mean? Is this related to the (implied definition of a) hash tag in S2.2 - which is incorrect, as they do not have a '#' prefix - the examples given are terms or concepts, or acronyms. Further, the corresponding example for DBLP is a path. I would reword this to say e.g., acronym/term/concept - which is easily extended to a hashtag. OR provide an unambiguous definition of "tag" here.
The information extraction & data processing carried out is described, based on the ontology model followed.
The dataset is currently incomplete - by the authors' own estimation, approximately 1/6 of their target dataset has been processed and is currently available. However, which sections are complete/available is not clear - (in S3.6) - "As it can be seen in table 3 currently most conferences date from 2008 and 2009 since those dumps from WIkiCfP has been imported completely yet." - "yet" implies there is still data to be imported for those years - however the beginning of the sentence implies that all but these two are yet to be imported.
Also, are there plans to update COLINDA? - the end date reported is 2011 - however this submission is mid 2013. Especially considering the use case mentions Twitter as a source of affinity data, 2011 is woefully out of date.
S3.3 - are the processes to retrieve a single instance via REST and the full dataset via the SPARQL endpoint independent? This seems to be the case, if so isn't it inefficient? Why is the data not simply stored in the RDF store only? Also, the data processing involves a number of steps, first a conversion to CSV, before dumping into a MySQL database, then the conversion to RDF on demand - why so many intermediate steps?
* Metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth.
Statistics provided only at a high level - in independent tables listing conference counts by year and country. See point above on statistics, and below on internal connectivity. Ontologies reused are mentioned, with a brief discussion of relevance/applicability. However, information on coverage of concepts other than year and country/location, and relations in COLINDA's model is missing.
* Examples and critical discussion of typical knowledge modeling patterns used.
Known shortcomings of the dataset.
The model used is described; there is however no comparison with previous work or recommended practice. Further, the discussion is in some parts difficult to follow - key being the extraction of location information.
In S3.4 - owl:sameAs and swrc:location are two different things - why do the authors see them as equivalent? From Fig.1 - swrc:location points from the RDFType for the conference to a location - which makes sense - owl:sameAs here would not make sense.
The discussion about dissemination of locations is confusing. What is the "top 5 conference count"? And what is the proportion of this to the complete dataset? The point made about it being " ...only 1/6 part of the whole locations contained" is not very meaningful without this. Fig.2 on the other hand shows the majority of instances to contain location information.
* Usage - the potential to use this dataset for to provide data for mashups is proposed.
The authors give one example of use - the Researcher Affinity Browser - however it isn't obvious what contribution COLINDA makes to the browser. Also, is 4.1 the authors' extension to this browser, or simply an explanation of how it works? If the latter it would be more useful if it was used to illustrate clearly where it works with COLINDA.
************ Other points
Fig 1 is a bit confusing - how is rdf:type a node, and a central one at that? Should id not be followed rather by the conference ID?
WikiCfP and Eventseer are NOT web "pages" but sites.
"Such pages can be considered as scientific event announcement pages editable by the users with archiving character. " - this sentence is confusing - too many things are being said in one sentence. Assuming "users" are people who submit new events, what do they edit? Can they continue to edit post submission? How is "archiving character" relevant here - who/what is archiving - the users or the sites?
"Eventseer contains according the latest infromation4 information about around 21000 events ..." - would suggest changing the first "information" to data or report.
"Scientific events from both pages date from 2002 up to now ..." - implies it extends to the time of reading, rather than the point when the paper was written. Although even the latter would be incorrect - the paper states data ranges from 2007-2011.
"Listing 1 shows a simple entry from an WikiCfP data dump that was used to create instances from COLINDA ..." - do you mean FOR rather than FROM?
"Mesh Ups" - used several times - should this be "mashups"?
It would be useful to show the results of running the query in listing 3.
"... possible appliance case ..." - appliance incorrect, should be " ...possible APPLICATION" or "possible (use) case"
Fig. 3 is too small for the reader to follow the text description of use - it is only legible at very high magnification on-screen.
"... special affinity criteria ..." - does this mean a way of measuring similarity between researchers? - affinity used in this way is unusual, and simply makes the reader's job harder. Even simply showing examples of the affinities in the snapshot would be helpful - this part of the snapshot is hidden.
How is the video of the browser relevant to COLINDA?
Citations & References
Wrt Berners-Lee's "5-star" - should really cite the article itself, and add the link to the diagram as more detail.
Presentation, spelling & grammar
The paper is difficult to follow, mainly because of the presentation. An auto spelling and grammar check IS needed, but more importantly, a proofread. The issue is not that the authors may not have English as a primary language, but more that the paper appears not to have been read through to pick up basic errors and ensure readability.
I would also suggest the authors read the criteria for this paper type and ensure they've answered each of the key requirements - I don't that doubt this has been done, but rather that the information is not provided so the reader easily understands it. Papers from the first call should also give pointers to this.
|
Comments
Submission in response to
Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-call-2nd-s...