A dynamic HDT variant

Tracking #: 3244-4458

Authors: 
Eirik Kultorp
Wouter Beek
Stefan Schlobach

Responsible editor: 
Guest Editors Tools Systems 2022

Submission type: 
Tool/System Report
Abstract: 
HDT is a compact yet queryable storage format for RDF data. HDT scales to very large datasets and is used in production environments. While it’s possible to merge two HDT files, it is not possible to perform removals. This prevents use of HDT for some use-cases which require dynamism. In this paper we propose and evaluate a dynamic HDT variant. Our results show that we achieve performant update operations without a prohibitively high cost for other operations, making our implementation a viable alternative to the existing HDT implementation if dynamism is required.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Miel Vander Sande submitted on 27/Oct/2022
Suggestion:
Major Revision
Review Comment:

This paper presents an extension to the binary RDF format Header-Dictionary-Triples (HDT) to allow update operations. The authors therefore present a couple of alterations to the internal structure, namely
- a customised trie-based index for dictionary instead of the compacted static data structure
- an uncompressed version of the exiting BitMapTriples data structure
- new functions with respect to inserting triples and subtracting HDT’s in addition to the existing creation, merging, export, search and retrieval functionality

This HDT variant is compared to the original HDT design in order to validate the 4 research questions about runtime and memory usage. The authors conclude that the dynamic variant is degrades all HDT aspects, but is a feasible alternative.

The contents of the paper are interesting and offer some practical value. However, there are some significant flaws that need to be addressed before this paper can be accepted:

1. The text is OK written and not well structured, which makes it very hard to read. Especially section 3.1.2 is problematic. Overall, the text can use more subdivisions and better explanatory figures. Fig 1. and Fig 2. don’t really help much (the evaluation is ok).

2. The original HDT format has three features that forms its USP:
- it’s really compact archive format (like a zip)
- you can still perform very fast lookups
- once created, it’s memory efficient.
From what I gather from the design and evaluation, the dynamic variant is none of these things, which raises the question whether a comparison to (just) HDT still makes sense. It seems that lightweight triple stores such as RDF4J, Jena, Oxigraph should also be included since they share more common goals with this dynamic variant.

3. The research questions seem cherry-picked instead of painting an objective, general picture. They are quite vague, meaning you can satisfy them one way or another. Concepts like “order of magnitude” and “best workaround” should be more concrete, so you know what the real success criteria of the approach are.

4. The differences in the made trade-offs (compactness vs. dynamicity) also skew the experimental results. For instance, in 4.3, if you sacrifice the shared index for a simpler, less compact one, it’s quite obvious it’s going to lower compactness and increase the ease of merging. There’s also no real compression requirement anymore, which gives a lot of advantage over the static HDT. For 4.1, I would also argue that HDT creation could also be implemented in a less memory-intensive way (eg, by using a key-value store as temp dictionary instead of memory, or more recent, by using a cascade of hdtMerge operations), which makes the decrease in memory usage not that meaningful. It would be if this was directly caused by the new data structures, but it isn’t really.

5. The choices made in the trie design are not very well motivated. From 3.1.1: Why is marking nodes as deleted not acceptable? What’s wrong with a 255 bytes limitation? Why is the lexicographic correspondence so important?

Review #2
By José M. Giménez-García submitted on 28/Oct/2022
Suggestion:
Reject
Review Comment:

This paper proposes a dynamic variant of HDT that supports update operations. The changes needed to support these operations have an impact on existing operations of HDT, but the authors argue that it is not “prohibitively high.” The dynamic variant uses tries for the implementation of the Dictionary component of HDT, and removes the Shared section (they argue having it is not feasible). The Triples component uses 4-tuples without compression.

I think the paper addresses the important challenge of updating the content of HDT files, and proposes an approach to do so. However, I have several problems with the submission; mainly with the experimental results and the availability of the tool.

With respect to the availability of the tool, it is not open source and it seems it is not even available (it is only possible to find a docker image in the resources, but no indication about how to use it, and nothing in the paper that points to the tool). This is a big negative point for a tool/system report.

With respect to the experimental results, the authors evaluate runtime and peak memory usage against HDT for existing operations (1, 2, and 3 down below), and against “the best workaround” for new operations (4 and 5 down below), as well as the size of the HDT files. The variant shows the following gains or losses w.r.t. the existing implementation of HDT:
(1) Serializing RDF in plain text into HDT: Increase in runtime between 7% and 89%. Decrease in peak memory usage between 5% increase to 34%. Increase in size of the HDT files between 81% and 900%.
(2) Extracting HDT into RDF in plain text: Variation in runtime from -2% to +44%. Increase in peak memory usage between 71% and 473%.
(3)Merging 2 HDT files: Decrease in runtime between 32% and 43%. Increase in peak memory usage between 81% and 213%.
(4)Subtracting an HDT file from another: Decrease in runtime between 32% and 43%. Increase in peak memory usage between 81% and 213% (the evaluation is made against having the two files in RDF in plain text, using the comm command to generate the difference, and serialize to HDT).
(5)Updating an HDT file with a set of additions and a set of subtractions: The results vary depending from the proportion of triples that are modified (the lower the better), with decreases in runtime between 58% and 99% and decreases in peak memory usage between 32% and 64% (the evaluation is made against the serialization in HDT of the RDF file in plain text that contains the end result of the modifications)

The results are not convincing. The dynamic variant shows a slight improvement in runtime when serializing or merging files, and a significant improvement in both runtime and memory usage when updating HDT files. But that comes at the cost of increasing the size of the files up to 900% and requirements of two or four times the memory requirements for extracting the content of an HDT file or merging two HDT files.

In addition, the experimental evaluation have the following problems:
-The paper says nothing about the hardware where the experiments have been performed.
-The workaround for experiments 4 and 5 seems a bit arbitrary. I wonder if other approaches would yield better results. My intuition is that piping the triples from HDT files instead of reading from plain text would improve the runtime, for example.
-There are no experiments about triple pattern queries. This is in fact the most important operation of HDT, and I think this should be mandatory in any evaluation against the current implementation of HDT.
-The datasets used in are skewed towards having a big proportion of literals. I feel that that skews the results in favor of the proposed dynamic variant (for two reasons: IRIs are more compressible in the using plain front coding; and literals only appear as subjects, so removing the shared section in the dictionary has less impact).
-The experiments lack some details. For example, I assume that the update in the dynamic variant includes the times of reading the HDT, making the modifications, and writing against to disk, but it should be written clearly.
-In the last experiment, the fact that triples are extracted from every N lines of the HDT file can introduce bias (since the data is sorted and the triples are taken in a regular way. Randomizing the triples before taking a sample would be recommended.
-In the last experiment, the authors argue that “that this systematic extraction is a worst-case scenario for our dynamic variant, as it means the data we will need to access and update is evenly distributed throughout the data structure.” They don’t explain why that’s the case. I think that the fact that the data is sorted has a much bigger beneficial impact than the negative impact of this.

In addition, there are some other issues worth commenting:
-There is no convincing evidence about the importance and impact of the tool. And while the importance seems high, the possible impact seems low, given that it is not open source and it seems to be developed with the sole purpose of being used inside Triply, the company who owns the tool.
-The resources provided contain the scripts to get the data and a docker container to replicate the results, but no instructions about how to run them or the requirements that are needed.
-The related work includes references to very basic work that I’m not sure it’s relevant, while excluding works in the area of compact data structures (such as suffix trees, patricia trees, or packed compact tries), that are the most relevant for HDT.
-The related work is also missing other works that use tries to index RDF, such as [1].
-The decision of including counters seems arbitrary, ad-hoc for the needs of the authors, and counterintuitive: An RDF dataset is a set of triples. It follows that if you merge two sets and remove an item afterwards, the resulting set should not contain the item, even if it existed in the two first sets. I think that for this tool to be useful it would be needed to have this feature as optional, since the general case would be not to use it.
-I think that including the figures of the sequences that encode the tries (both the existing and the new tries) would be very useful to understand the design.
-The structure and writing of the paper are a bit confusing sometimes. For example:
-The related work is in the section about the design.
-In the design, it is not clear when they speak about how a generic trie would be implemented and where their modifications begin.
-The terminology is not used appropriately (the authors use the word “section” for both the HDT components and the Dictionary sections).
-There is no mention to the language of the implementation until the very last sentence of the paper.

All in all, while I think the authors tackle an important challenge in the state of the art of HDT, and the idea behind their work is worth exploring. But given the problems in the experimental results, the fact that the tool is not open source and and the availability is not clear, and the current state of the paper, I have to recommend the rejection of the paper.

[1] Pelgrin, O.P., Hose, K., Galárraga, L., 2021. TrieDF: Efficient in-memory indexing for metadata-augmented RDF, in: Proceedings of the 7th Workshop on Managing the Evolution and Preservation of the Data Web. Presented at the MEPDaW, CEUR Workshop Proceedings.