Blue Brain Nexus: An open, secure, scalable system for knowledge graph management and data-driven science

Tracking #: 2974-4188

Authors: 
Mohameth François Sy
Bogdan Roman
Samuel Claude Kerrien
Didac Mendez Montero
Henry Genet
Wojciech Wajerowicz
Michaël Dupont
Ian Lavriushev
Julien Machon
Kenneth Pirman
Dhanesh Neela Mana
Natalia Stafeeva
Anna-Kristin Kaufmann
Huanxiang Lu
Jonathan Lurie
Pierre-Alexandre Fonta
Alejandra Garcia Rojas Martinez
Alexander D. Ulbrich
Carolina Lindqvist
Silvia Jimenez
David Rotenberg
Henry Markram
Sean Hill

Responsible editor: 
Stefan Schlobach

Submission type: 
Tool/System Report
Abstract: 
Modern data-driven science often consists of iterative cycles of data discovery, acquisition, preparation, analysis, model building and validation leading to knowledge discovery as well as dissemination at scale. The unique challenges of building and simulating the whole rodent brain in the Swiss EPFL Blue Brain Project (BBP) required a solution to managing large-scale highly heterogeneous data, and tracking their provenance to ensure quality, reproducibility and attribution throughout these iterative cycles. Here, we describe Blue Brain Nexus (BBN), an ecosystem of open source, domain agnostic, scalable, extensible data and knowledge graph management systems built by BBP to address these challenges. BBN builds on open standards and interoperable semantic web technologies to enable the creation and management of secure RDF-based knowledge graphs validated by W3C SHACL. BBN supports a spectrum of (meta)data modeling and representation formats including JSON and JSON-LD as well as more formally specified SHACL-based schemas enabling domain model-driven runtime API. With its streaming event-based architecture, BBN supports asynchronous building and maintenance of multiple extensible indexes to ensure high performance search capabilities and enable analytics. We present four use cases and applications of BBN to large-scale data integration and dissemination challenges in computational modeling, neuroscience, psychiatry and open linked data.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 24/Feb/2022
Suggestion:
Minor Revision
Review Comment:

This paper describes Blue Brain Nexus an open source tool for knowledge graph management.
Overall, this is an impressive tool. I want to call out in particular the developer documentation, code availability, and reproducibility.
This rises to the level of a commercially supported open source project including docker images, tutorials.
The system itself shows both the need and the possibility of combining multiple software stacks in order to enable knowledge graph management.
I thought this was particularly insightful.

The paper shows impact with four deployments the Blue Brain rodent brain project, the human brain project, a candidian centre fro neuroinformatics and Swiss research data management.

Overall, I think this is exactly the sort of tool that I would expect being featured in SWJ.

I did have some comments that I think would make the paper better.

In Section 2, I think this list of challenges is strong and really resonates but many of them have been discussed in different places in the literature whether that is data preparation challenges (see [1,2]) or complying with metadata standards to facilitate reuse [3]. It would be good to add citations in this section for each of the challenges as I think it buttresses your point that these challenges exist.

Furthermore, it would be good to extend the discussion of other knowledge graph management systems. Here I was thinking for example WikiBase [4], KGTK [5], Open PHACTS [6], Linked Data Hub [7], etc. As well as the larger space of semantic data integration by referencing the survey in [7] for instance.

Lastly, it might be useful to comment on sustainability of the tool.

Minor comments

- It's nice to cite W3C-PROV.
- Some of the formatting seems odd pg 1.
- It's nice to provide a reference for the definition of knowledge graph (e.g. [9]

[1] https://hpi.de/naumann/projects/data-preparation/data-preparation-biblio...
[2] Mazhar Hameed and Felix Naumann. 2020. Data Preparation: A Survey of Commercial Tools. SIGMOD Rec. 49, 3 (September 2020), 18–29. DOI:https://doi.org/10.1145/3444831.3444835
[3] Koesten, Laura, et al. "Dataset reuse: toward translating principles to practice." Patterns 1.8 (2020): 100136.
[4] https://wikiba.se
[5] Ilievski F. et al. (2020) KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis. In: Pan J.Z. et al. (eds) The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science, vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_18
[6] Groth, Paul, et al. "API-centric linked data integration: The open PHACTS discovery platform case study." Journal of web semantics 29 (2014): 12-18.
Hogan, Aidan, et al. "Knowledge graphs." Synthesis Lectures on Data, Semantics, and Knowledge 12.2 (2021): 1-257.
[7] https://atomgraph.github.io/LinkedDataHub/
[8] Mountantonakis, Michalis, and Yannis Tzitzikas. "Large-scale semantic integration of linked data: A survey." ACM Computing Surveys (CSUR) 52.5 (2019): 1-40.
[9] Knowledge Graphs, Synthesis Lectures on Data, Semantics, and Knowledge, No. 22, 1–237, DOI: 10.2200/S01125ED1V01Y202109DSK022, Morgan & Claypool

Review #2
By Michel Dumontier submitted on 17/Apr/2022
Suggestion:
Accept
Review Comment:

(1) Quality, importance, and impact of the described tool or system
The described system is of very high quality, importance, and impact.
It is of high quality owing to the use of state of the art technology including a streaming event-based architecture with multiple indexes (elasticsearch for document indexing and filtering; blazgegraph for RDF knowledge graphs and SPARQL for query answering), hence providing system robustness in case either is not available; complete documentation in how the system implements the the FAIR principles, including detailed provenance metadata of objects therein and SHACL shapes for validation of RDF graphs.

It is of high importance as it proposes a new system to address the widely faced challenge of a making and exploiting FAIR research data (and its lifecycle) in a manner that ensures quality, attribution, and reproducibility. Functional systems that are also open source remain lacking.

It has high impact as the system is already used in a number of high profile projects including the human brain project, clinical and research environments at the Krembil Centre for Neuroinformatics, and the research data connectome. The role of the system in the context of the use cases is convincing.

(2) Clarity, illustration, and readability of the describing paper
The paper is clear, illustrated, and easy to read. The paper is mostly focused on describing the system architecture and convincing the reader of the value of the system. There is no meaningful discussion of its limitations, nor a roadmap for future work, nor how the project will be sustained past the current funding cycle.
The project is well documented on GitHub and read-the-docs documentation. The documentation appears sufficient to deploy,test, and extend the solution.

Review #3
By Joe Raad submitted on 02/May/2022
Suggestion:
Accept
Review Comment:

This paper presents Blue Brain Nexus, an ecosystem for data and knowledge management that is open-sourced by the Swiss brain research initiative Blue Brain.

Blue Brain Nexus is composed of three main components:

- Nexus Delta: composed of Cassandra (scalable NoSQL database), Blazegraph (scalable triple store) and ElasticSearch (scalable text search). These datastores are all exposed under a single secured API.

- Nexus Fusion: Web interface for Nexus Delta, allowing users to upload and search for data, and configure access permissions.

- Nexus Forge: Python framework that allows users to build knowledge graphs from various sources and formats, and to validate the data using SHACL.

The authors describe in details the architecture of this system and its components, and provide several use cases in Section 7.

This is a system report submission, so it will be reviewed along the following two dimensions:

(1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided).

I think the project presented in this paper is exemplary in its approach of developing scalable data and knowledge management solutions, using open standards, for practical use in complex real-world scenarios. I believe that the approach adopted by the Blue Brain Nexus ecosystem is an additional proof on the added value of these open standards and existing technologies, such as SPARQL and SHACL.

The value of Blue Brain Nexus is clear to the reader from the way the system is designed, and its importance is demonstrated by its adoption in several use cases described in Section 7.
-> My only take is that it would have been valuable if usage statistics were provided as well in this section (e.g. number of active users, number of commits, number of queries per day, etc.) to clearly show how this system is being adopted in real-world scenarios.

The system is extremely well-documented on its website and Github repository, and includes a number of getting-started examples and codes.

(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool

Overall, I think the paper is very well-written. It manages to provide developers with a large amount of technical details required to fully understand the architecture and the performance of the system, while at the same time clarifying to the average researcher and user the important functionalities that this system covers and the problems it solves.
-> That being said, I believe that some detailed technical parts of this paper (mostly in Section 4) can be reduced, and moved to the supplemental material or another technical documentation that the authors can refer to in the paper.

The motivation and background section shows that the authors are fully aware of the challenges and the requirements of such systems.
-> Here, I would have hoped if the authors would mention few other inter-disciplinary projects that, similarly to Blue Brain, deal with the same challenges around managing the data and knowledge produced in data-driven science cycles. It would be great if the authors can list significant high-level differences in the approaches taken by different projects to solve these common problems. And whether there is a possibility to reuse existing components and tools from other projects.

----

Finally, given the quality and the value of the introduced system, and given the clarity of the paper and the available external documentation of this system, I recommend the acceptance of this system report in the Semantic Web Journal.