Review Comment:
# Review of "A Linked Data Wrapper for CrunchBase"
**Summary: The dataset has a high significance and proven third-party uses but the paper itself has major issues so that I see it as borderline between reject and major revisions.**
## Quality and stability of the dataset
The quality of the interlinks to DBpedia is ensured using manual evaluation of random samples.
The quality of the data itself is presented through the conversion of an established base dataset and the usage of a reasonable methodology.
URL: API URL given
version date and number: version date not given but the paper distinguishes between a first and a second version
Licensing: given for the source data but exact license for the RDF dataset is missing (stated as “non-
commercial purposes”)
Availability: complete RDF ntriples dump, JSON-LD API, ntriples API for single resources. SPARQL endpoint not given. Ontology with VoID description available as well.
## Usefulness of the dataset
The paper extensively discusses benefits and gives promising use cases for queries and integrations, such as for job search. Even better, the dataset is already used for financial data analysis in a peer-reviewed publication.
## Clarity and completeness of the descriptions.
### Innovation
> Nowack already provided an RDF wrapper for the CrunchBase API called Semantic CrunchBase in 2008, the service is no longer available.
A (blog entry)[(http://bnode.org/blog/2008/07/29/semantic-web-by-example-semantic-crunch...)] for Semantic CrunchBase states “The initial RDF dataset is not using any known vocabs such as FOAF (or FOAFCorp). (We can INSERT mapping triples later, though.)”.
This is a major issue. According to the above statements, the existing approach could be extended with “mapping triples” to integrate vocabularies. A thorough motivation, why this existing approach was not extended but a completely new approach was taken, is essential.
### Unproven Statements
Many claims made are vague and/or not accompanied by citations. Examples:
> “Is used by millions of users”
Provide exact numbers over some time period and provide a source.
> In contrast, many professional Crunch Base users may want to formulate more elaborate queries.
“may want to” is speculation, at least reformulate, better yet find a source.
> Having up-to-date answers to such questions can result in better market insights.
Instead of “can” cite source that says it does.
### Factual Errors
#### 5 Star Ranking
>Originally, the Crunchbase vocabulary would be a 1-star vocabulary according to Tim Berners-Lee’s star rating [...] Our CrunchBase RDF data set is a 5-star data set, as we provide our data set in RDF and link entity URIs (organizations) to DBpedia and our vocabulary URIs to other vocabularies.
I think you are confusing the 5-star deployment scheme for Open Data with the 5 stars of Linked Data Vocabulary Use as requested to discuss by the SWJ.
The statement that the original data has 1 star is incorrect: In the Open Data rating it would get at least 3 because it is “available in a non-proprietary open format”, i.e. JSON. I don’t have an API key but according to the CrunchBase docs, it seems like the REST interface uses “URIs to denote things, so that people can point at your stuff”, so it would get 4 stars here. If you mean the Linked Data Vocabulary Use stars, it wouldn’t even get 0 stars because it is not Linked Data.
The resulting RDF data on the other hand would get 4 stars, because it contains links to DBpedia but not 5 because there are no backlinks from other knowledge bases, as far as described. If there are backlinks, e.g. from Lee et al., then clearly state that, then it would have 5 stars.
#### JSON to RDF
> “five out of all 38 papers mention JSON as input or output data format, but only the description of the Facebook RDF Wrapper [8] describes a conversion of JSON to RDF
It should always be an input and not an output format, as RDF should always be the output format of a method to produce an RDF dataset. Also, the claim is false, as “LinkedSpending: OpenSpending becomes Linked Open Data” is a Semantic Web Journal paper that transforms JSON to RDF.
### Formal Criteria and Writing
I did not find any error with grammar, formatting and spelling.
The writing is sloppy at times, though, with phrases such as “the topic of CrunchBase is a bit special”.
Many sentences are unnecessarily verbose; fixing this could help to achieve the 10 page limit, which is exceeded by half a page right now. For example, consider the following passages:
> CrunchBase was founded in 2007 by Mike Arrington, the founder of the TechCrunch weblog, to track data about startups covered in posts. Nowadays, CrunchBase is used by millions of users to track the fast-changing world of startups.
Reads like an advertisement. Not necessary to know who the founder is. Compress to one sentence.
> According to the authors, the reported work has been well-accepted at several public events and conferences such as the 26th XBRL conference.
Unnecessary. The reference already tells the reader that it is a peer-reviewed publication.
> For this information, we queried our CrunchBase RDF data which we retrieved via our Linked Data API. See Section 3 for more information” this is clear
## Other questions and comments
>“Should we invest in startup X?”
If users edit CrunchBase themselves, how to prevent abuse, such as misrepresentation of one's own company?
>Such a query, formulated in natural language, might be: “Which companies existing at most 5 years have been acquired for more than 1 bn USD?”
You could thus add Semantic Question Answering to future work.
> we have implemented a Linked Data API as wrapper around the publicly available CrunchBase REST API;
> the official CrunchBase API is only accessible with an API key
Please clarify, is it freely available or do you need an API key potentially with costs?
If it is the latter, are there legal problems with openly publishing it as RDF? Judging from the website, it seems like organizations and people are freely available while product information costs money. You state that it is available for non-commercial purposes, does that include products, etc., that is only available from the API with a key?
>This confidence value is encoded in binary format [...]
I would suggest “is encoded as a bit array” or “bitset”.
|