Increasing the Financial Transparency of European Commission Project Funding

Paper Title: 
Increasing the Financial Transparency of European Commission Project Funding
Authors: 
Michael Martin, Claus Stadler, Philipp Frischmuth, Jens Lehmann
Abstract: 
The Financial Transparency System (FTS) of the European Commission contains information about grants for European Union projects starting from 2007. It allows users to get an overview on EU funding, including information on beneficiaries as well as the amount and type of expenditure and information on the responsible EU department. The original dataset is freely available on the European Commission website, where users can query the data using an HTML form and download it in CSV and most recently XML format. In this article, we describe the transformation of this data to RDF and its interlinking with other datasets. We show that this allows interesting queries over the data, which were very difficult without this conversion. The main benefit of the dataset is a further increased financial transparency of EU project funding. The RDF version of the FTS dataset will become part of the EU Open Data Portal and eventually be hosted and maintained by the European Union itself.
Full PDF Version: 
Submission type: 
Dataset Description
Responsible editor: 
Pascal Hitzler
Decision/Status: 
Major Revision
Reviews: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Resubmission after a "reject and resubmit", now conditionally accepted with major revisions. First round reviews are beneath the second round reviews.

Solicited review by Aidan Hogan:

Thanks to the authors for their revision and comments. The paper now reads very well. Most importantly, issues relating to dereferenceability now seem to be resolved.

I am recommending an accept, but I've jotted down some minor details I'd like the authors to have a quick look at:

Editorial:
Section 1
* "ensures a long time availability" -> "ensures a long term availability"
Section 2
* "and - in addition - the" [Don't use hyphens here. Use a dash, like an en-dash or em-dash, which are longer.]
* "...the RDF literals" I understand what the authors mean, but RDF literals is not a good term to use for URIs.
* Figure 1 is very nice, but I recommend lightening the fill shades for better contrast when printed black and white. Also would prefer if the prefixes corresponded with Listing 1, etc.
* "The resource is [an] instance"

Section 3:
* First paragraph of Section 3.1 is poorly written. Have another look.
* Virtuoso only has partial support for SPARQL 1.1 whereas other engines have full SPARQL 1.1 support. I would rephrase/remove that justification.
* "27.000", "48.860" Use commas (or thin spaces) for separating thousands. Be consistent (including tables).

Section 4:
* Your query times out when I try it. It looks expensive. Again, I wonder if the sub-queries are really necessary (instead use GROUP BY and avoid the second same-term comparison in the FILTER to make more efficient?).

Some minor comments with respect to the dataset (some are subjective comments just to think about):
* Linking years to DBpedia seems a bit like you're linking for the sake of it. I would prefer to use a direct numerical value, possibly even a xsd:gYear value (whose value can be directly and unambiguously interpreted as a specific year).
* I'm not sure about mixing datatype and object values for co-financing. I think it would be better to make the model more consistent in this respect (e.g., for ease of querying).
* Why amounts in 1,000's of euro? This would only complicate queries involving other datasets (e.g., divide by GDP).
* Align fts-o:subject with dcterms:subject?

Minor quibbles aside, the authors have done a good job!

Solicited review by Emanuele Della Valle:

All (but the following) comments in my previous review have been addressed. Please check,
page 2 columns 2, the IRI "fts-o:cofinancingRate" runs into the margin. I would rewrite the sentence as "If projects are co-financed, the respective information is encoded by using the fts-o:cofinancingRate property."

Solicited review by Jerome Euzenat:

The presented data set concerns one primary source: EC-funded projects
as published by the commission. This seems to be a good source because it contains data not published elsewhere. Moreover, the EC website is not easy to manipulate.

The authors did a good job in addressing previous comments both on the
paper and the data set. The fts vocabulary drawing (Fig 1) is great
even in greyscale. In general, I think that the readability of the
paper largely improved on focussing on what is offered, but I still
stand on my statement that ''the paper is more "usable" than the
current way the dataset is made available.''

Indeed, dereferencing the ontology (which according to my previous review worked last time) provides the HTML message:
"The resource http://fts.publicdata.eu/ontology/ you are trying to reach does not exist."

I would say that the wiki is still not my favorite way to access the
data set. The system is not particularly intuitive and is very slow.
It is quite commin to get an error
(asking to answer on 100 beneficiaries got me a "Internal Server
Error").
However, it is now browsable and usually up.

Concerning the specific questions raised by the call:
* Quality of the dataset
This seems to be a high quality dataset properly currated and linked.

* Usefulness (or potential usefulness) of the dataset
Potential usefulness is quite high. However, the current
implementation is lagging behind expectations. This may be because the
dataset should be migrated somewhere else?

* Clarity and completeness of the descriptions
The paper is clear.

As a detail, I would have separated the Related work in two parts:
- the first 4 paragraphs are rather Lesson learnt (it is not really
Related work)
- the remaining paragraph could be Related resources

On p2 about the encoding of percentages: why a string instead of a
float between 0 and 1? or between 0 and 100?

On p4: it is mentioned that the XSD hs been modified. Does this have
been reported back to the data source?

Details:
- p3 : I think that the caption of Listing 1 should be FTS resource
instead of dbpedia (?)
- p4 : mid part -> lower part
- p5 : RDF data a -> RDF data
as SPARQL endpoint -> as a SPARQL endpoint
e.g. -> e.g.,
city country names -> city and country names
Ref 9 uk -> UK ({UK} in bibtex)

First round reviews:

Solicited review by Aidan Hogan:

This paper describes ongoing efforts to RDFise and publish information about EU project grants (since 2007) as Linked Data. The information relates to budget information including amounts awarded, year(s) of award, the beneficiaries involved, their country, etc. The ultimate goal is to have the EU host the data themselves as part of the "EU Open Data Portal". The information is currently available on the EC website, browsable as HTML or downloadable in CSV/XML. The paper provides an overview of the raw data, details of how the data are triplified, including a vocabulary to model the data and workflow for transforming the data; how the data are exposed on the Web; how links are generated to LinkedGeoData cities/countries; and provides an example SPARQL query to show-case why having RDF is useful, demonstrating a query that takes GDP information from DBpedia and combines it with local data on EU project funding for various countries. Some related work is also presented. Without links, the final dataset consists of over three million RDF triples. Links number in the low thousands.

The work is obviously of significance, particularly if it influences/guides the EU in terms of Linked Data exports from the EU Open Data Portal. The data should be of interest to have as part of, e.g., the Linked Open Data cloud. The description itself is quite comprehensive. The English should be improved a bit (some typos, badly constructed sentences and misspellings are present), but this is not a significant problem. In general, the paper does a good job in motivating publishing the dataset as RDF, and I like the inclusion of the SPARQL query in Section 4 to make the benefits more concrete.

The description does have some significant shortcomings however, primarily to do with Section 2.2.

* I'm happy to see the model/vocabulary depicted in Figure 1. However, it's too compressed to be easily read; it's difficult to get the overall picture here. One option is to take another page and expand this diagram (or shorten Related Work to make space). Also, perhaps the diagram itself can be reformatted to be more like a UML diagram (e.g., include the datatype-properties in the class boxes; use the object-properties to connect the boxes, or such like). In any case, the diagram should be improved.

* Perhaps as a side-effect of Figure 1 being difficult to read, the text of Section 2.2 is difficult to follow. I think it would be better to frame the discussion around a concrete example. For instance, the paper states: "The core element of the schema is the class fts-o:Commitment, which is used to type RDF resources describing project funding details." To me, this is a bit ambiguous. Having a concrete example of instance data to refer to would help significantly, and could maybe allow the text to be shortened.

* There is a similar problem in Section 2.3. For example, I could not follow the paragraph "The beneficiaries that participate alone in a commitment...", nor the paragraph afterwards. Without the context of what is in the raw data, such statements are difficult to interpret.

Presentation aside, an important aspect of the evaluation for this Special Issue is the dataset itself. Much like the other Martin et al. paper I reviewed, I found some worrying shortcomings when looking at the data online. First and foremost, I could not successfully dereference any Linked Data URIs, nor could I dereference the ontology. I tried:

http://fts.publicdata.eu/resource/ci/MILANO-Italy
http://fts.publicdata.eu/ontology/
http://fts.publicdata.eu/cm/CCR.IHCP.B433923.1

And got errors similar to:

"""
OntoWiki Error

Zend_Controller_Action_Exception: Action "ci" does not exist and was not trapped in __call()
/var/www/fts.publicdata.eu/libraries/Zend/Controller/Action.php@485 (404)
"""

Also, although the landing page clearly has navigation over RDF data, it has various deadlinks (including to the SPARQL endpoint) and is often quite unresponsive. I did find the SNORQL endpoint linked from the paper, and it does seem to have the bulk of data loaded. However, I tried the query in Listing 1 (as suggested by the authors in the paper) but it did not work: the endpoint does not contain any owl:sameAs links, hence the query returns no results.

I grant that these issues could (probably) be easily fixed, but they do not inspire confidence and practical aspects are crucial to the evaluation of submissions to this Special Issue. Although the work shows potential, these latter issues with the dataset -- particularly the lack of working dereferenceable URIs -- are the main reason why I cannot recommend an accept at this time.

MINOR COMMENTS:
* Various typos and poorly constructed sentences. Please proof-read more thoroughly.
* Does the vocabulary really need to introduce more equivalent classes like City? Why not just use the remote equivalent class directly? (Though I see the arguments in terms of stable namespacing, the proliferation of equivalent terms worries me more.)
* Fix the spacing in Table 1.
* owl:Ontology? (Again.)
* Should note explicitly that Listing 1 is a SPARQL 1.1 query.
* As far as I can see, that query can be represented without needing sub-queries (you don't use any solution modifiers and only need one "level" of aggregates that can be done in the main query). This makes the query much more straightforward: e.g., you don't need the equality check between ?ftscountry and ?dbpcountry, you can just use one. Finally, I don't see how the query will even work as stated without GROUP BY on country.

Solicited review by Emanuele Della Valle:

The paper presents in a failry-written format an valuable dataset about Europe's Television. The dataset is of medium size and it is externally connected to DBpedia. Hereafter, I list my major and minor comments.

= Major Comments =
- the authors should clarify the final quality of the dataset that they are describing. They tell that they manually assess that DBpedia spotlight provided them with low quality links to DBpedia, but it remains unclear if they delete the wrong links from the dataset
- the authors should add a description of the EDUCore vocabulary and possibly show it in an image; it will allow readers to issue SPARQL queries against your endpoint
- the authors may want to describe few example of queries that show the relevance of the dataset for other users
- if necessary, the authors can cut to half the project description (Section 2) and drastically reduce the description of the MINT tool. The special issue call for dataset descriptions.
- the authors may also want to comment on the licensing. Why do they introduce their own licenses? Why not using Open Data Commons?
- the URLs of the SPARQL endpoints on page 5 column 2 appear rather insatiable. I recommend to use a reverse proxy and expose them at least throughout the project web site. I also suggest the authors to check the possibility to donate the dataset to the European Union Open Data Portal.

= Minor Comments =
- is a VoID description of the dataset available?
- the link to googledocs (page 4) should be replaced at list by a technical report published on the project website
- add references
- page 2 column 1
- project's survey and reports
- MPEG7
- DCMI
- TV Anytime
- page 3 column 1
- EUscreen has written a set of guidelines
- OAI-PMH
- page 5 column 1
- DBpedia
- DBpedia Spotlight
- typos
- page 3 column 1
- the URL of MINT is redundant since it appears also on page 2 column 1
- page 4 column 1
- At this point -> At this point,
- (i.e. -> (i.e.,
- (For example -> For Example
- item.) -> item.
- EBU Core classes -> EBUCore classes
- page 4 column 2
- (i.e. -> (i.e.,
- (e.g. -> (e.g., *** two times ***
- to gen language -> to generate language
- page 5 column 2
- info from google anytics -> info from google analytics
- format references according to the journal style

Solicited review by Jerome Euzenat:

The presented data set concerns one primary source: EC-funded projects as published by the commission.
This seems to be a good source because it contains data not published elsewhere. Moreover, the EC website is not easy to manipulate (too many required fields in the query form?). So the best thing to do is to retrieve the CSV.

The paper gives a useful overview of the encoding design. This should be published as documentation together with the data set since it allows eventual users to understand the dataset. However, I personally would have preferred a more complete paper instead of the description of the process used for creating the dataset.

Unfortunately I went to test this dataset, and although there seems to be a Virtuoso, and Ontowiki there, it triggered too many errors to be usable.
I have been unable to correctly test the data set. It is sometimes not working. The ontowiki system seems to be dependent on cookies but does not detect their availability.
The vocabulary (http://fts.publicdata.eu/o/) is not dereferenceable. In fact, the "o" should be replaced by "ontology".
Then dereferencing on the web is still puzzling: For instance, the commitment concept seems to have no instance, the generated source is empty, I tried to negotiate xml and got html instead.

So, it seems that the dataset itself is not in a state so that publishing a paper about it is worth publishing. In fact, the paper is more "usable" than the current way the dataset is made available.

I agree with the presented points that the described setup, both ontowiki and virtuoso-enabled SPARQL-endpoint would offer a more usable dataset than the original commission portal, both for simple users and developers. But at the moment, this does not seem to be the case.

The example SPARQL 1.1 query and proposed use is a really good illustration.

The related work part is a bit difficult to read because it jumps from generalities (TBL and data.gov.uk) to the very particular of this projects and group approach (csv, Stat2RDF, etc.). It would be better to stay in the middle.

The paper in itself is not tremendously exciting, but this is not the purpose of such paper. It can do a good job at advertising the dataset and the advantages of linked data over the commission portal.

Concerning the specific question raised by the call:
* Quality of the dataset
The authors show how they improve this so it seems better than what is produced by the source.

* Usefulness (or potential usefulness) of the dataset
The dataset is potentially very useful for generating various statistics an example of which is given in the paper. Unfortunately, at the moment it does not seems to be very usable.

* Clarity and completeness of the descriptions
The paper is quite clear. However, it describes more the process to generate the dataset than the dataset itself. This may be useful, but, in terms of completeness, what are the main categories, how they are used, what are the liked vocabularies are elements that would be more useful to find.

Maybe a list of data sources to which this is linked and a presentation of the URI used would be useful.

Note that my recommendation to do major revision, certainly apply as well to the dataset. It is also more based on the three criteria above than the usual journal criteria.

p2: spatial relations -> spatial coordinates?
p4: lables -> labels

Tags: