World War 1 as Linked Open Data

Tracking #: 716-1926

Authors: 
Eetu Mäkelä
Juha Törnroos1
Thea Lindquist
Eero Hyvonen

Responsible editor: 
Jens Lehmann

Submission type: 
Dataset Description
Abstract: 
The WW1LOD dataset is primarily a reference dataset meant to link collections dealing with the First World War. For this purpose, the dataset gathers events, places, and agents related to the war from various authoritative sources. Additional information on the entities is also recorded, in order to be able to answer more complex questions relating to them. The approach is being evaluated using an existing WW1 online collection. In addition, discussions are ongoing with several other organizations about making use of the dataset.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aidan Hogan submitted on 24/Jul/2014
Suggestion:
Major Revision
Review Comment:

In terms of the official criteria for the 'Data Description' track, I can say the following:

1) Quality of the dataset:

The authors have obviously put a lot of thought into ensuring the high quality of the content of the dataset by only considering authoritative historical sources; a lot of the discussion in the paper refers to where the dataset is sourced from and why those sources were chosen. Likewise, according to the authors, links generated to other datasets have been verified by domain experts.

In terms of the technical aspects of the dataset, these have been much improved in that a lightweight vocabulary has been made dereferenceable, URIs appear valid and standard vocabularies are not being used more often. However, I still notice quite a few bugs in the system.

For example, when I try to access:

http://demo.seco.tkk.fi/ssaha/project/resource.shtml?model=ww1lod&uri=ht...

I get an exception embedded at the end of the data:

( Expression p.valueTypeLabel is undefined on line 48, column 44 in saha3/resource.ftl. The problematic instruction: ---------- ==> ${p.valueTypeLabel} [on line 48, column 42 in ...)

When I tried to access the link:

http://ldf.fi/ww1lod/main/

as given in the paper, I get an authorisation request and a subsequent 401. This suggests that the system is still in a prototype stage.

2) Usefulness:

In general, it is difficult for me to assess usefulness since I am not an expert on WW1 and I am not the target audience for the dataset. I don't believe I would have reason to use it myself.

However, more generally, what I can say is that the content of the dataset appears to be of high quality and manually curated from authoritative sources. The drawback of this approach is that the dataset is quite small: I count around ~3k entities, ~40k triples, and ~300 links to other datasets. Moreover, whatever about the size of the dataset, the scope of the dataset in the context of WW1 seems quite limited. A couple of hundred key events have been annotated, but the focus of the dataset is on the atrocities in Belgium. The authors may argue that this is a seed for further contributions, but since the paper was first submitted a year ago, I am unsure if any major advancements have been made in broadening the scope of the dataset in any significant way.

Given that the dataset is rather specialised, I think it really targets domain experts in the area. As such, interfaces and tooling are important. And there are some quite nice demonstrators, such as the annotated reader, the Linked Data browser, and the SPARQL editor assistant. But there are parts of the system that seem unusable. For example, if I click this link:

http://demo.seco.tkk.fi/ssaha/project/resource.shtml?uri=http%3A%2F%2Fld...

I get a bunch of triples in the form of a predicate–object list, where both predicates and objects are given using full URIs. This is not human-readable at all and I doubt domain experts would find this intuitive (I certainly didn't). For example, the time-span of the event in question is presented as:

http://www.cidoc-crm.org/cidoc-crm/P4_has_time-span | http://ldf.fi/ww1lod/95ed7607

This is unreadable; I still don't know what the time-span was. Furthermore, if I click on the object URI in the interface, I get no information about the time-span, just a bunch of references. However, if I access the system through http://ldf.fi/ww1lod/, I get a much cleaner interface with maps and more readable values.

As such (and together with the prior technical issues), I question whether or not the system is still in a prototyping phase.

3) Clarity and completeness of the descriptions:

The paper is quite well-written and does a good job of motivating the work. However, the paper relies too much on the homepage, where an overview of the dataset and its statistics are provided, as well as example queries that can be run against the SPARQL endpoint. In general, I think the authors should present a lot more of this information in the paper itself: at the moment, I found the homepage probably more informative than the paper itself. Instead, insofar as possible, the paper should stand alone as an independent contribution. As such, I think that the authors should present and discuss some example queries in the paper, as well as some of the more interesting statistics given in the VoID description.

In summary, my concerns about the size and the scope of the dataset and the prototypical nature of some of the systems remains; likewise, I am concerned about the completeness of the article as a standalone description.

In terms of what can be fixed for a revision:

* Fix the technical problems with the system as outlined above.
* Provide example queries and important high-level statistics in the paper itself, rather than pointing to the homepage for more information.

This then just leaves my concerns about the size and the scope of the dataset; if the authors address the above issues I would be willing to accept the paper for publication despite the limited scope of the dataset. However, the editor or other reviewers may see differently.

Also, some of my previous minor comments were, unfortunately, not addressed:

MINOR COMMENTS:
* "user needs research" -> "user-needs research"
* "academic sources. [13,17]" -> "academic sources [13,17]."
* "Further, the thesaurus" -> "Furthermore, the thesaurus"
* "1914-1918" Again, use en-dash for intervals. Though most cases have been fixed, some still remain.
* "the WW1LOD vocabulary" Fix the bad box.
* "In these integrations however, a problem that appears is that ..." Poorly written. Rephrase.
* Numeric columns in tables should be right aligned.

Review #2
By Michael Martin submitted on 14/Nov/2014
Suggestion:
Major Revision
Review Comment:

Motivated by the recent centenary of the outbreak of the first world war and the historical interest the authors present a dataset containing events, places, and agents related to the first world war. The dataset is published through a SPARQL endpoint.
Regarding the initial version of the paper, this version contains a set of changes which improved the quality of the paper quite well.
In the following this review contains remarks according to dimensions defined in the call.

== Usefulness (or potential usefulness) of the dataset ==
The the ww1lod dataset is intended to be a reference dataset for linking various publications and collections related to the first world war.
As a reference dataset it can help students and researchers to locate documents and data related to a given topic and in the other direction to help readers to better understand the context of places, events, and people mentioned in a given document.
Especially it can help the reader to understand ambiguities or consolidate variations in names of persons and places among different documents.

== Clarity and completeness of the descriptions ==
Overall the paper is well written but have to be enhanced on a few points.
As described in the next section, the dataset itself, its concepts and the description about the process undertaken to create the dataset is not in a final state.
The criterion "completeness of the dataset" can be voted as rather incomplete, or better: as a reader I am not sure about the completeness.
It would be very helpful to know for instance how many events happened during the first world war (approx.) and a comparison to those which where included in the dataset.
The paper only contains the information that 689 events are addressed in the dataset.

== Quality of the Dataset ==
The dataset is build with data from various sources which are judged as authoritative and quality-controlled by domain experts.

The authors state to provide a reference dataset for events related to the first world war.
A focus is especially put on the events and places around Belgium while the paper leaves the impression that the data is quite sparse in the general aspects of the war besides a base dataset provided by the Imperial War Museum (IWM).
For a reference dataset one would expect a wider evaluation of its completeness.

=== Name, URL, versioning, licensing, availability, source for the data ===
The WW1LOD dataset is published in the http://ldf.fi/ww1lod/ namespace and licensed under the terms of Creative Commons license: CC-BY-SA 4.0.
The dataset can be accessed through a SPARQL endpoint under http://ldf.fi/ww1lod/sparql and the individual resources are accessible according to the Linked Data guidelines. Information about versioning are not given and a complete dump of the dataset is not provided. The paper states that the dataset is regularly updated, enriched, and maintained by researchers. A void-description of the dataset can be found under the URI (this information would be helpful if available in the paper).

=== Purpose of the Linked Dataset, e.g. demonstrated by relevant queries or inferences over it ===
The purpose of the dataset is demonstrated using an example application and four SPARQL queries, for which the authors refer to the project homepage. The four queries give answers to the following questions:
- Works in the CU WW1 Collection related to themes that are also related to General Friedrich Adolf Julius von Bernhardi
- Items from Europeana, the CU WW1 Collection and Out of the Trenches relating to events that happened in West Flanders
- Units of the German 3rd Army ordered by the count of atrocities they participated in Belgium
- Population change in Belgian provinces during the war years as compared to the number of atrocities as well as total events that occurred there

=== Applications using the dataset and other metrics of use ===
For the demonstration of the dataset a contextual reader application was implemented. Unfortunately it seams to mainly connected DBpedia instances instead of instances from the ww1lod dataset and the annotations contain some mistakes, e.g. the phrase “400,000 men” is annotated with “http://dbpedia.org/resource/400” which describes the year 400. While the annotation e.g. of “deportations” is very useful showing a time line and a map of war events.
For a reference dataset one would expect some external applications to use the dataset or some applications which are more than just a pure demonstration. At least it can be outlined, which external applications are able to use the dataset.

=== Creation, maintenance and update mechanisms as well as policies to ensure sustainability and stability ===
The creation process of the dataset was sketched and the contributors were referred. A description of the maintenance and update mechanisms are missing and would be helpful to be included in the paper.

=== Quality, quantity and purpose of links to other datasets ===
According to the paper the dataset is linked to DBpedia with 152 links, to the “Out of the Trenches”-dataset with 29 links and to GeoNames with 1248 skos:closeMatch links.

=== Domain modeling and use of established vocabularies ===
The paper gives a coarse overview of the core data model in figure 1 but states to actually use the CIDOC-CRM (International Committee on Documentation – Conceptual Reference Model) RDFS schema. The relation of the two data models is not clearly described in the paper.
Besides the CIDOC-CRM RDFS schema according to the paper following vocabularies were used: RELATIONSHIP, FOAF, and schema.org for modeling actors.
Unfortunately the given void:exampleResource -instances ( void:exampleResource ?i) are not very verbose:
e.g. one can't say what the triple should tell, since dereferencing doesn't result in any information resource (neither RDF nor HTML) and the triple store doesn't contain any triple matching ?p ?o . Further the person is not involved in any event.

=== Examples and critical discussion of typical knowledge modeling patterns used ===
A few discussions (but not critical ones) are included such as modeling places/locations and their temporal relations.
The authors discussed that this is an difficult topic, especially in the domain of war events.
Having a look at the dataset one can see the following:
For describing persons the class is used.
rdfs:subClassOf
while according to the triple store “[…] comprises people, […], who have the potential to perform intentional actions for which they can be held responsible.” and it is stated to be owl:equivalentClass of foaf:Agent.
Further the description states that owl:equivalentClass foaf:Person. In the owl recommendation document one can read “both class extensions contain exactly the same set of individuals” while adding the concept of “responsibility” certainly narrows down the extension opposed to the intention described for foaf:Agent and foaf:Person.

=== Known shortcomings of the dataset ===
Shortcoming are not explicitly described in the paper but as written in section "7 Discussion and Future Work" the creation and enrichment of the dataset is still ongoing.
The dataset is not yet interlinked with resources from the Europeana project, even though the authors point this out as a potential improvement.

== Minor remarks ==
Section 2 last paragraph (page 2): inconsistent spelling of project name “WWI LOD” vs “WW1LOD”
Section 3.1 first paragraph (page 4): “Deutsche Verbände und Truppe” -> “Deutsche Verbände und Truppe[n]”
Section 5 4th paragraph (page 6): Why is the fact that the dump is produced by a SPARQL query a reason for not evaluating a dataset?
Section 6 4th paragraph (page 7): “WW1LOD” runs out of the text-box
Section 6 7th paragraph (page 7): “Catcholic” -> “Catholic”

Review #3
By Francois Scharffe submitted on 16/Nov/2014
Suggestion:
Minor Revision
Review Comment:

This paper presents a dataset of about the first world war. The datasets aims to provide reference resources and a common vocabulary to which other related datasets could link. The dataset is of modest size, concentrating on the quality of the data sources used in the building process. Considerable attention is given to the vocabularies and wide reuse of existing authoritative vocabulary is made. Semi-automated linking to existing LOD dataset is performed whenever possible. The dataset is hosted on a linked-data server providing dereference-able URIs and a SPARQL endpoint.
The autors mentioned other datasets where on the process to be included at the time of publication. It would be good to update the paper if these datasets were actually integrated.