DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Tracking #: 499-1697

Authors: 
Jens Lehmann
Robert Isele
Max Jakob
Anja Jentzsch
Dimitris Kontokostas
Pablo N. Mendes
Sebastian Hellmann
Mohamed Morsey
Patrick van Kleef
Sören Auer
Christian Bizer

Responsible editor: 
Krzysztof Janowicz

Submission type: 
Tool/System Report
Abstract: 
The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available using Semantic Web and Linked Data standards. The extracted knowledge, comprising more than 1.8 billion facts, is structured according to an ontology maintained by the community. The knowledge is obtained from different Wikipedia language editions, thus covering more than 100 languages, and mapped to the community ontology. The resulting data sets are linked to more than 30 other data sets in the Linked Open Data (LOD) cloud. The DBpedia project was started in 2006 and has meanwhile attracted large interest in research and practice. Being a central part of the LOD cloud, it serves as a connection hub for other data sets. For the research community, DBpedia provides a testbed serving real world data spanning many domains and languages. Due to the continuous growth of Wikipedia, DBpedia also provides an increasing added value for data acquisition, re-use and integration tasks within organisations. In this system report, we give an overview over the DBpedia community project, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and showcase some popular DBpedia applications.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aidan Hogan submitted on 12/Jul/2013
Suggestion:
Accept
Review Comment:

In a sentence, this paper is essentially "everything you wanted to know about DBpedia but were afraid to ask". The authors discuss the framework used to extract RDF from Wikipedia, various types of extractors used, the composition and maintenance of the DBpedia ontology, statistics on number of mappings to the ontology, statistics on "hit rate" of mappings, statistics on sizes of various localised versions of DBpedia, methods for synchronising DBpedia with live changes occurring in Wikipedia, amount and target of outgoing links from DBpedia, estimates of incoming links to DBpedia, amount of hits the services of the site observes, etc. The paper also gives an overview of other research/commercial/community efforts that have used DBpedia and related systems that build structured encyclopaedic knowledge-bases or extract structured data from Wikipedia. The paper wraps up with future plans and directions for the DBpedia project.

In terms of the formal review criteria for tool/system report:

(1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided).

The DBpedia system/project itself involves a huge combined effort and the work described is (very obviously) of high impact. The described tool/system is free, open, and accessible on the Web.

(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

DBpedia has been published before in a 2009 journal paper, but given the raft of developments in the past few years, I believe that there is more than enough novelty to justify an updated tool/system report. The paper is generally well written and well structured. It provides a lot of detail on various aspects of the DBpedia pipe-line and resulting KB that should be of relevance to many practitioners.

As such, I commend the authors on a good job and recommend an accept. However, there are a number of typos, suggested clarifications and nit-picks that should be looked into, as follows.

== MINOR COMMENTS ==

# THROUGHOUT:
* Mixing UK and US English. "organisations", "center", etc. Make consistent.
* I think table captions go on top for the SWJ style?
* Fix obvious bad-boxes.
* Fix bad line-spacing, as per Section 2.4 (right column, bottom, p5).
* Some figures (esp. figure 6 & 14, also 9 & 13) won't work well in black & white.
* Prefixes are (re-)introduced in various places and sometimes after they are first used. Perhaps just create a prefix table to enumerate all prefixes?

# TITLE:
* Use a dash, not a hyphen.

# SECTION 1
* "for the success of the Linked Open Data initiative" Success somehow implies to me that it is all over and done and we can all go home but I don't see that as the case.

# SECTION 2
* "comprise [of] various"?
* "infobox data is useful[] if a"
* Table 1: dbr:Albedo, supposed to be symmetric?
* Table 1: some triples have full stops, other not.
* Table 1: Artificial[_]Languages?
* dbr:Ford_GT40: For both examples, the result is not RDF. It seems that the outer square brackets are redundant.
* "dbo:[]populationTotal"

# SECTION 3
* Fig 4: Would be interesting to know why classes drop in 3.7 -> 3.8.
* "Table 3 reports the number of instances for a set of selected ontology classes" How were these selected?
* "20 language version[s]"

# SECTION 5
* "Linksets, which are added to the repository[,] are used for the ..." OR "Linksets that are added to the repository are used for the ..." [In this case, they mean different things.]
* The SPARQL query of Section 5.1: is there an endpoint against which this can be run?
* "on [an] instance-level"
* The description of links for Sindice refers to "see appendix for details" although no appendix exists. I would also like to see those details since, for example, purl.org is the top most linked dataset but contains no data (rather being a redirect hub). Similarly, I could expect w3.org to be highly linked since RDF-namespace URIs, which appear in nearly every RDF document, could be considered links to that domain.

# SECTION 6
* "Virtuoso supports infinite horizontal scale-out" The word 'infinite' is perhaps a bit of a hyperbole, no? Not to mention physically impossible.
* In the discussion of the SPARQL endpoint (near the discussion of 509s), I think it would be worth mentioning the overall results for the availability of the DBpedia SPARQL endpoint from http://labs.mondeca.com/sparqlEndpointsStatus/.
* In Section 6.3, I'd like a notion of how many unique IPs are involved. For example, visits are expressed over a 30 minute window. I'd thus be interested to know if the unique IPs would be close to 48x the presented figures (indicating the same IPs accessing the services frequently), or 1x the present figures (indicating different IPs each time). Adding a column to table 10 for AvgIP/Day would be great.
* Just to clarify, the /data /page and /resource services on DBpedia don't redirect requests to SPARQL? That is to ask, the dereferencing requests are not part of the SPARQL logs?
* Figure 11/12/13. It is not necessary to include the title of the plot, esp. if the caption states the same thing.
* "all the PUT requests" Is PUT intended? Not POST?
* Figure 14 caption: "syncronization" (sp)

# SECTION 7:
* "It can not only improve ... but also ..." A bit awkward. Rephrase.
* "answer questions over DBpedia correctly[,]"
* Though readable, the writing of Section 7.4 should be improved. The paragraph is too long and the flow is poor. It starts off by talking about Wiktionary at length without offering a context as to why Wiktionary is relevant to DBpedia. Some of the sentences are awkwardly phrased (esp., "Existing tools can be ..." "Opposed to ...").

SECTION 8:
* "Wikidata does not aim at offering the truth about things" I would disagree. They aim to offer the truth about claims like 'The capital of Berlin was 3.5 million in 2011 according to X'. Their claims just happen to be more detailed.
* Footnote 41: Add http:// for consistency.
* "One of the projects[ that] pursues"
* "type systems: [w]hereas"
* "YAGO contains [many] more"
* "[2] was an infobox extraction approach". Preferably use a system name or an author name or "we" as the subject, not a reference number.

SECTION 9:
* "in the last years in particular also in terms of" Rephrase.
* "we can reach []significantly"
* "is not only capable to create the" -> "is capable not only of creating the"
* "we were going a first step" -> "we were making a first step" (or "are" perhaps)
* "Both projects complement each other nicely." The collaboration between Wikidata and DBpedia seems to be more one-way than this claim suggests. If I understand correctly, Wikidata requires all data/schema to be input manually, meaning that they effectively cannot reuse anything from DBpedia? (As stated, DBpedia could make use of Wikidata, but not vice-versa?)"

REFERENCES:
* Check capitalisations of, e.g., "rdf".

Review #2
By Prateek Jain submitted on 05/Aug/2013
Suggestion:
Major Revision
Review Comment:

'DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia' provides an overview and update regarding the DBpedia project. The paper gives details about the DBpedia framework, implementation, interlinking on the LOD cloud and about the applications which are using DBpedia dataset in one form or the other. The work is an update of the previous two DBpedia related papers and gives 2 major details of (1) DBpedia Live Framework (2) Different Language Mappings. DBpedia dataset and project are very well known within the community and this work will provide valuable update for researchers and practitioners interested in the work.

My main comments related to the paper are as follows

1. Can the authors give some background regarding the motivation behind using Section 2.1 - Parsing. More specifically, why is AST being used for the task. Is there any specific advantage which is obtained by using it in this specific setting? (Please Note: This is a question and not a criticism of the work in any way)

2. The work presents a whole bunch of interesting statistics related to DBpedia. Some of these statistics perhaps change frequently for e.g. language mapping and SPARQL query log data. Is it possible to have a webpage on DBpedia project website which allows it be updated periodically. This will allow fresh numbers to be used and they can be asked to cite the work.

3. Is there some measurement such as precision and recall for the different DBpedia extractors? These numbers are useful for people interested in using this data as well as other researches in NLP field who might want to do a comparative evaluation for their own extractors. How much have these extractors improved since their inception?

4. Minor Comment: On Page 5, is it possible to include the end date in the extracted data i.e. 1969, just for the sake of making the example complete.

5. How good are the mapping validators (Page 6)? Can the authors provide some measurement regarding the precision and recall? Or can they point to other publications which give the numbers? Are there any corner cases?

6. On Page 7, there is a mention of entities which are included in the regional versions of Wikipedia but the English language Wikipedia does not have any entries related to them. The way in which the authors are handling it, seems fine to me. However, it makes me wonder if there should be a process for adding these entities to the English language. (I understand, the authors have no control over this and is not a shortcoming of their work, but since they have worked with Wikipedia for a while, what do they think about this?)

7. I am curious why there is little or no use/description of any kind of reasoning algorithms to infer interesting/contradictory facts within the DBpedia framework. Is it a work in progress or some kind of a scalability issue?

8. What is the meaning of the "Official DBpedia Chapter"? What does it imply? Who grants the status of being official?

9. Is it possible for the authors to split the applications of DBpedia into commercial and research subsections? This will allow other to easily identify and utilize them for various surveys such as Commercial use of LOD datasets.

10. There are typos and grammatical errors for ex: In Introduction
tools have been build -> tools have been built
Section 2.4, in 2010 a community effort has been initiated -> effort was initiated?

Overall, I like the work and I think it will be a great article for the journal and the community. I look forward to reading the revised version of the manuscript.

Review #3
By Aba-Sah Dadzie submitted on 17/Aug/2013
Suggestion:
Minor Revision
Review Comment:

Written by key project members, the paper provides a detailed report of the evolution of DBpedia, with a focus on new features and services, and updates since the last two reports ( published 2008 & 9). The report concludes with examples of use and potential avenues for further application. It also presents a comparison with other data extraction projects -highlighting similarity in aim, if not always in features provided or supported, and showing where they and DBpedia complement each other. The roles played and the contributions made by the wider DBpedia community are also described.

The paper is written more like a white paper than the typical system report. It is therefore also much longer - some of the descriptions could be condensed a bit. Importantly, it does not report explicit evaluation; however, utility may be inferred from usage, 3rd party and by project members.

As the authors themselves note, the relevance of the DBpedia community project to the Semantic Web, NLP and IR communities, among others, is evident in reuse, and beyond the references in this report. Th paper should serve as a useful reference for continued use and further (community) development of the resource.

One thing makes the report a bit difficult to follow - several references are made to points in DDpedia history - either using (numerical) publication references or version numbers. While the authors of these papers and the creators of DBpedia can easily map these to points in time, the average reader cannot - so ends up wasting time looking these up. It would be useful to provide a lookup early in the paper mapping version numbers to release dates, and also, where references ask the reader to judge based on time, explicitly provide these (as year or qtr/year as appropriate).
Similarly, the expression "last/past years" is used several times wrt to evolution of DBpedia - this is simply too vague - at worst it should specify last few or several - the reader should not need to keep checking to see what time span this probably covers.

DETAILED REVIEW

S1 - intro - "This system report is a comprehensive update and extension of previous project descriptions in [1] and [5]. The main novelties compared to these articles are: …" - "novelties" as used here is incorrect. I'd suggest "advances" or, simply, "new work".

What exactly is access here "…in Section 6, we provide statistics on the access of DBpedia"

S2.1 - why is N-Triples particularly given as the only example of output format?

A lookup table for the language codes would be useful - not all are easily guessed at, e.g., eu, tr, ar, hr; even Greek is not easily guessed as el - otherwise the reader needs to do extra work to confirm these.

S2.4 - "Mapping Validator: When editing a mapping, the mapping can be directly validated by a button on the edit page." - that the user clicks on a button is irrelevant - what matters is that functionality is available for triggering the validation.

S2.5 - "Recent DBpedia internationalisation developments showed that this approach resulted in less and redundant data [23]." - confusing - what does "less and redundant" mean? From the context I suspect "and" should be deleted?

S2.7 - "One of the major changes on the implementation level is that the extraction framework has been rewritten in Scala in 2010 to improve the efficiency of the extractors by an order of magnitude compared to the previous PHP based framework." - what is the (value of the) order of magnitude? - otherwise cannot gauge the value of extra work done. Also, why Scala? not saying it's not a good choice, just that drawing attention to it begs the question.

"They are important for specific extractors as well, for instance, the category hierarchy data set (SKOS) is produced from pages of the Category namespace. " - how is SKOS relevant here? - it is without doubt not the category hierarchy data set.

S3.1 - "It can be observed in the figure that the Portuguese DBpedia language edition is the most complete regarding mapping coverage." - actually quite difficult to locate this - the bar in question should be highlighted with some annotation.

S3.2 What were the criteria for selecting the 20 language versions? Ditto - the 10 in Table 3

S6.1 "To host the DBpedia dataset downloads, a bandwidth of approximately 18 TB per quarter or 6 TB per month is currently needed." - redundant - both halves of the sentence say the same thing.

S6.2 - reads more as an advert for Virtuoso than a description of how it is used as a store for DBpedia.
It would be more useful to say what the status code 509 represents than to give a link to the definition of status codes in wikipedia. Also, if anything at all the pointer should be to the formal definitions at w3.org, and not wikipedia.

S6.4 "Figure 13 shows the percentages ... As we can see, the usage of the SPARQL endpoint has doubled from about 22 percent in 2009 to about 44 percent in 2013." - actually, no, cannot see this - the authors may know the mapping from DBpedia version to release year, but the average reader will not. There are several other instances where such oblique references are made.

S7 - a lot of the examples of use are self-citations, I would suggest that reuse by the DBpedia team be noted - and maybe presented separately? - to demonstrate reusability beyond its creators/maintainers - this is what really strengthens the case for the value of DBpedia.

7.4 - is confusing - is this a positive example of use or one highlighting that applications are not always well implemented?

8.2 "Apart from this, link structures are used to build the Wikipedia Thesaurus Web service48. Additional projects presented by the authors that exploit the mentioned features are listed on the Special Interest Group on Wikipedia Mining (SIGWP) Web site49." - who are the authors being referred to - the creators of the thesaurus in the previous sentence? In which case - who are they? - both references are to a URL, not a publication.

Citing [2] as related work - with the precursor of DBpedia is a bit unusual - it's not really related work, but more a previous incarnation of the system (or part of it).

The paper concludes by saying "Despite recent advances …, there is still a huge potential for employing Linked Data background knowledge in various NLP …" - this is contradictory - I suspect what is meant is something like "Recent advances show huge potential for ..."

The Sindice query in the appendix is so short as to be better placed within the paper.

FIGURES & TABLES

Fig. 1 is placed two pages after it's referenced. Yet there's more than enough room to place it on the same page. Further, this would made it easier to interpret the corresponding text.

Fig. 2 - text in greek mapping box a bit confusing - why are some properties in greek and some in english?

Fig. 4 - x-axis should specify release date - especially as the curve will look a bit different if they're not evenly spaced out.

Fig. 5 - wd suggest (faint) background lines from y-axis - difficult to map values to bars.

Fig 6 - release versions need to be shown with the dates

b - the colours for en & pt, and el & fr (tr is only slightly darker) cannot be distinguished, even in colour, let alone greyscale, which is the default for printing to read.

Figures 11 and 12 are related to each other (also cross-referenced together in the text) and should be placed next to each other, not on different pages. Otherwise difficult to compare them. Also, while I acknowledge the caption states the charts are for SPQRQL access, it would be useful to state this in the table headers.

Convention places table captions ABOVE tables.

Table 1 - would be useful to indicate which of the four extraction types (in 2.2) each extractor is classified as.

Table 7 is referenced before 6 - should therefore come before it.

Table 15 - line count for XML config files is not very meaningful without a description of content

CITATIONS & REFERENCES

The intro forward references the DBpedia ontology - it would be useful to provide an exact cross-reference.

S7. In a lot of the "external" examples a web link is given to the project or organisation using it. Where a citable publication exists this should be used instead, or in addition to the URL, e.g., the BBC tag disambiguation example should reference a publication such as [1], Watson has a few articles that cite DBpedia (among others) as a data/reference source

Verify that capitalisation correct in all references, e.g., PowerAqua, not Poweraqua, in 28; GraphD, not graphd in 32; RDF, not rdf, in 41, DBpedia, not dbpedia in 45.

LANGUAGE & PRESENTATION

The Latin abbreviation cf. is used incorrectly in most cases - it means "compare (with)" - mostly used in place of "refer to" or "see".

There is some overuse of commas, making reading a bit difficult - a comma should be placed only where there's a natural pause in reading. E.g., 8.1.3 "One of the projects, which pursues similar goals to DBpedia is YAGO44 [39]. " - commas should be used only if it was written "One of the projects, YAGO, pursues similar goals to DBpedia [39]. "

Mostly minor grammatical errors easily caught with an auto check and proof-read. I'd recommend the latter by a single author - the paper reads quite uniformly, for one with such a long author list. There are however a few areas with differences in writing style and correctness in language use.

S7.1.2 - "Due to its large schema and data size as well as its topic adversity, " - should this be "topic DIversity"?

weird formatting, bottom of p.3 & 5

[1] Georgi Kobilarov, Tom Scott, Yves Raimond, Silver Oliver, Chris Sizemore, Michael Smethurst, Christian Bizer, Robert Lee: Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Connections. ESWC 2009:723-737

Review #4
By Oscar Corcho submitted on 24/Aug/2013
Suggestion:
Minor Revision
Review Comment:

This paper describes the current status of the DBpedia project in the form of a system paper, focusing on the following aspects: the current architecture of the extraction framework, the approach followed for the internationalization of the knowledge base, and the management of outgoing links. The rest of the pper is focused on describing some related work (other similar kbs like those of wikidata, Freebase and Google Knowledge Graph, etc) and applications that use DBpedia.

As a system paper, this manuscript is at a good level of detail in explaining the overall approach, but obviously given the breadth of aspects that are aimed to be described, it does not go into too many details in each specific aspect that is described. However, the paper content is worth to be read for those interested in how DBpedia has been created and is being maintained.

I have the following minor comments over the contents of the paper, divided by sections:
- On the extraction framework the description is very clear.
-- However I have some concerns about the examples. For instance, in page 5 it is not clear how the value 4181cc is transformed into 4181, or why in the property production only the first figure is extracted.Besides, why does the extractor know that name is in English?
-- In page 6, it is not clear how the template property engine can be divided into displacement and powerOutput. Besides, how do you generate rdfs:label instead of foaf:name?
-- I would suggest naming the mapping validator by "mapping syntactic validator", as it could be misleading if the description is not read properly.
-- It would be also nice to have more details about the mapping tool. Having created mappings myself with the wiki-based editor I consider this as a very important tool to pay attention to, which could make the mapping creation process much more efficient and less error-prone.
-- It would be useful to know which are the corner cases commented in section 2.5.
-- Something that I lack overall in the description is some comments about the quality of the data generated. I know, for instance, of lots of problems in quality in the mappings generated due to the fact that much of the mapping editing is done by copying, pasting and editing a bit.
-- It would be also good to have an indication of how good the gender extractor described in section 2.6 is.
-- The extraction of data from tables should be also better explained, as that extractor is not incuded in the table of extractors presented earlier in section 2.

- On the ontology:
-- It would have been interesting, IMO, to have more details about how this ontology has evolved over time, specially in the context of the internationalization process. -for instance, is the increase in properties due to some problem in quality?

- The live synchronization section is well described and concise enough, and the same applies to the interlinking part. Here I do not think that the given SPARQL query adds much.

- On DBpedia usage, the results that are presented are quite exhaustive and very interesting. However some parts may not be needed (for instance, the paragraph on ACLs does not seem too interesting). Wrt queries, the analysis is very superficial and could be improved.

- The description of applications is obviously incomplete, given the large amount of applications that rely on DBpedia, but representative of some of the most typical applications that can use this knowledge base. I would only propose reducing a bit the wiktionary part, as it is too large.

- On the related work I would also probably propose removing this section, as it does not add that much in a system paper. The same applies to the appendix.

Typos:
- have been build --> have been built
- equivialent --> equivalent