Data journeys: explaining AI workflows through abstraction

Tracking #: 3407-4621

Enrico Daga
Paul Groth

Responsible editor: 
Guest Editors Ontologies in XAI

Submission type: 
Full Paper
Artificial intelligence systems are not built on single simple datasets or trained models. Instead, they are build using complex data science workflows involving multiple datasets, models, preparation scripts and algorithms. Given this complexity, in order to understand these complex AI systems, we need to provide explanations of their functioning at higher levels of abstraction. To tackle this problem, we focus on the extraction and representation of data journeys from these workflows. A data journey is a multi-layered semantic representation of data processing activity linked to data science code and assets. We propose an ontology to capture the essential elements of a data journey and an approach to extract such data journeys. Using a corpus of Python notebooks from Kaggle, we show that we are able to capture high-level semantic data flow that is more compact than using the code structure itself. Furthermore, we show that introducing an intermediate knowledge graph representation outperforms models that rely only on the code itself. Finally, we reflect on the challenges and opportunities presented by computational data journeys for explainable AI.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 10/Mar/2023
Review Comment:

The paper is a revised version of the paper #3199-4413.

The revised version addressed my previous comments, in particular:
- the pseudocode of the proposed algorithms was simplified and a corresponding description was included;
- the authors presented the results of an exploratory user study whose aim is two-fold: i) evaluating the data journey ontology, and ii) evaluating concrete data journeys.

I am satisfied with the changes, thus I recommend acceptance.

However, in preparing the camera ready of the paper I would recommend the authors to apply the following (minor) changes:
- It is difficult to read Fig. 3. Maybe you can show an excerpt of the data node graph (the main concepts) and rest could be described in the text.
- It seems that figures about the user study results could be replaced by tables (especially Figure 8 which is very long). Please also show stddev values whenever possible.