Review Comment:
This paper describes a practical workflow for transforming SDMX collections into Linked Data. The authors focus on four relevant statistical datasets:
- OECD: whose mission is to promote policies that will improve the economic and social well-being of people around the world.
- BFS Swiss Statistics, due to the Federal Statistical Office's web portal offering a wide range of statistical information including population, health, economy, employment and education.
- FAO, which works on achieving food security for all to make sure people have regular access to
enough high-quality food.
- ECB, whose main task is to maintain the euro's purchasing power and thus price stability in the
euro area.
Nevertheless, the tool proposed in the paper can be easily used for transforming any other SDMX dataset into Linked Data.
On the one hand, statistical data are rich sources of knowledge currently underexploited. Any new approach is welcome and it describes an effective workflow for transforming SDMX collections into Linked Data. On the other hand, the approach technically sounds. It describes a simple but effective solution based on well-known toolss, guaranteing robustnees and making easy its final integration into existing environments. Thus, the workflow is a contribution by itself and each stage describes how it impacts in the final dataset configuration.
With respect to the obtained datasets, these are clearly described in Sections 7 and 8. These reuse well-known vocabularies and provide interesting interlinkage between them but also with DBpedia, World-Bank, Transparency International and EUNIS. Apache Jena TDB is used to load RDF and Apache Jena Fuseki is used to run the SPARQL endpoint. Datasets are also released as RDF dumps (referenced from Data Hub).
Finally, its relevant for me how scalability problems are addressed, because I think that 12 GB is an excessive amount of memory for processing these datasets (the largest one outputs less 250 million triples). Do you have an alternative for processing larger datasets? Maybe you could partition the original dataset into some fragments: is the tool enough flexible to support it? Please, explain how scalability issues will be addressed to guarantee big datasets to be effectively transformed.
|
Comments
Submission in response to
Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-call-2nd-s...
Response to reviewers
We would like to thank all reviewers for their time and valuable feedback. It is much appreciated.
I've tried to address some of your questions and comments below, and have made changes for the camera-ready version.
Problems with the original SDMX data is explained in terms of what the Linked SDMX templates do to handle or work-around the shortcomings, as opposed to correcting them - which would be more appropriate to take care of at source. For instance, the configuration option "omitComponents" allows the administrator to skip over any cube components that's in the datasets e.g., malformed codes or erroneous data that they are able to tell, or simply to leave out a component for their own reasons.
Some normalization is done e.g., removal of whitespace from code values when they are used in URIs as safe URIs, reference period values are converted to British reference periods (URIs).
Missing values e.g., human readable dataset title, are left as such, and fallbacks to dataset code. If license information is not provided, it can be explicitly set in the configuration file.
In order to use agency identifiers consistently, the configuration provides a way to add aliases for the primary agency identifier.
In SDMX 2.0, extensive descriptions for codelists are not provided. All of the datasets which were transformed used SDMX 2.0. AFAIK, SDMX 2.1 resolves this short-coming and that some of the SDMX publishers are adapting to 2.1.
The order of dimension values (e.g., as a path) in the observation URI is based on the order in the dataset i.e., it does not reflect the DSDs order. While the DSD makes the call on the order of the dimensions (in which qb:order is specified in the DSD in any case), the order of the terms in the URI is not too important - I do however think that it may be "okay" to reorder based on what DSD says (overriding the order in the dataset). I'll revisit this for the templates.
For code list Classes only used codelistID simply to follow the same pattern as in sdmx-code.ttl i.e., a code list has a see also to a class (which is of same nature i.e., referring to the code list as opposed to the code). But yes, the codelist could be a super-class of the class that's used for the code.
Omitted the annotations example because it is slightly extensive for the paper (too bad that there are artificial limits set for "papers" to share knowledge). If you are interested, there is an example at https://github.com/csarven/linked-sdmx/wiki#interlinking-sdmx-annotations.
The interlinks that are retained at the end have passed through a human-review process. They are correct in a sense that a concept like a reference area (e.g., country) from two different datasets use the same code notation and label. In other words, temporal bits are ignored - not to mention they are not available in SDMX sources (AFICT). This of course means that they are subject to being wrong since the country concept that's used in source A may be "former" country that's mentioned in source B. While such interlinking is critical, requires better interlinking methods and tools.
The current state of the paper is 8 pages (which is within the calls requirement: 5-8 pages).
The amount of memory (12 GB) that was used to transform was merely what was dedicated for the process itself. Based on tests, the minimum amount of memory that was actually required was 4 GB.
The triple counts of the datasets reflect the total amount of triples after all the transformations for each dataset. Basically, multiple SDMX-ML files are transformed from each source. The largest dataset (SDMX-ML) was in fact 6.6 GB, plenty for the Linked SDMX XSLT - http://csarven.ca/linked-sdmx-data provides a bit more information on the data retrieval process.
Thank you all again.