Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
HDT (Header, Dictionary, Triples) is a well known compact data format for RDF data. The main idea of HDT is to represent the labels of an RDF graph by unique integers, and to create a dictionary that maps these integers to their corresponding labels. Since labels can be long and can occur many times in one RDF graph, this simple dictionary mapping can achieve reasonable compression performance. In the present paper, the authors extend HDT by encryption. The idea is that different user groups have access to only certain subgraphs of an RDF graph. These subgraphs are encrypted (using standard public/private-key encryption) so that only those users which have the right to access a certain subgraph are able to decrypt that subgraph.
I find this research important and practically relevant. The paper is well written. Especially the first 10 pages that explain the framework are exception- ally well written and are a true joy to read. My concern with the present version of this paper is the experimental section. First, it is unclear how realistic the two datasets and their subgraphs are; I find it awkward to randomly select sub- graphs. The paper would greatly benefit from a real dataset with real access restrictions as they would actually be used in an industrial application. Second, the experimental section is way too detailed and overly verbose; an essential task has not been carried out yet, namely, to select and properly summarize the important data only and to draw (compact) relevant conclusions. The cur- rent version feels like a “full dump” of all the produced experimental data. My recommendation is to accept the paper for publication in The Semantic Web Journal, subject to a major revision of the experimental section.
Main Comments
----------------
(1) Is the “Efficient Compression” in the title of the paper justified? As far as I understand, there is nothing new in this paper concerning efficient compression. Also, I am bit worried by the actual compression and decompression times. For certain graphs Table 4 contains compression times of over 3 hours. That seems quite long, if the only task of the compressor is building a dictionary. One would expect running times that are similar to those of general purpose dictionary compressors (e.g., gzip).
(2) About the Experimental section: Why is it realistic to experiment with 6, 9, and 12 subgraphs only, and to have only 4 different LUB sizes that increase linearly. Why not scale up to 100 or 1000 subgraphs, and have LUB sizes at 1K, 2K, 4K, 8K, and 16K to obtain a more complete picture?
As mentioned before, I do not find it realistic to choose subgraphs randomly. I find it important to experiment with realistic subgraph selections as they would appear in an actual application.
I think many of the tables and graphs should be removed, and merely the summary of the (essential) observations should be given. There is simply too much data here. The graphs about querying (Figures 10–13) do not look very interesting, please remove them.
Table 2: can you explain why the sizes can increase, when going from 9 to 12 subgraphs (e.g., for G4)?
Table 3: I think this table can be removed, the data looks rather similar and predictable. It can be summarized in a few (clear) sentences.
Table 4: please remove.
Table 5: please explain these times in the text and remove this table.
Typos
------
page 1, right column, line 2: change “are allowed access to” into “are allowed to access”.
page 1, right column, line -9: spell out what “HDT” stand for.
page 2, left column, line 6: wrong opening quote mark.
page 2, left column, line 9: the reader can at this moment not understand where the “redundancy” comes from. Please improve.
page 2, left column, line -3: “minimise” is a strong word. I would be careful to use it.
page 2, right column, line 4–9: does this represent a good state-of-the art approach?
page 2, right column, line -12: one wonders how the final binary encoding works, e.g., whether a Huffman Code or similar is applied?
page 2, right column, bottom: it might look nicer to have k2 (i.e., the k in italic). How much space saving do the approaches [40], [34], and [28] give in comparison? As far as I understand, [34] gives the strongest compression (stronger than k2), is that correct? On the other hand, on page 3, left column, 10 lines above Section 3, it is written that “k2 triples ...is the most effective compressor”. Does “most effective” refer to “space saving”?
page 3, right column, line 4 of Section 3.1: check the typesetting of SO. (should be SO).
page 4, left column, line 5: it might be more intuitive to simply “apply” the dictionary as a function, and to write (D(s), D(p), D(o)). In Figure 2, can you please explain how the non-binary numbers, such as 3, 4, 6, 2, 7 etc in So are represented in binary? Do you use a fixed-length encoding for these? How does the “alignment” (line 3 above Section 3.3) work? Also the reader wonders at this moment, how blank nodes are represented (later you say something about it). page 4, left column, line -5: change “triples” into “triple”. Two lines below, change “dictionary dictionary” into “dictionary”.
page 5, right column: “It is worth noting” used twice in this paragraph. Maybe rephrase at least one of them.
page 6, line 2 of Section 4.2: please spell out AES.
page 7, line 4 of Section 5: remove “will”. Line -5: what does “no standard- izing apart” mean? By the way, this is US American spelling, while for other words (“minimising”) you use British English spelling. Please decide which one you want, and the make a consistent version.
page 8, left column, line 3 above Section 5.3: this superscript in HDT′ is not well readable.
page 9, left column: I am not sure this observation is important, but, count- ing the number of ones in a binary string is known as “rank”; there are many results and data structures for efficiently determining the rank of a position in a bit string. How is it implemented in your system?
page 9, right column: n:m-relation, better write n:m.
page 12, right column: please draw a clear conclusion (instead of reading out the numbers).
pages 13 and 14: if encryption and decryption times are so negligible, then why even mention and show them?
page 14, line 3 of Section 6.3: it is unclear to me what “fair” means here. page 15, left column, line 2 of Section 6.4.1. Insert “the” before “i.e.”.
page 15, right column: the first paragraph is as expected, and also the
bottom paragraph can be removed, I think.
page 16, it is not clear to me whether random queries give realistic picture
as to how the data structures would behave in a real setting. I think the paper tried to do too much here, and looses focus. I think the paper would benefit if the querying part was reduced to a few short paragraphs that summarize the outcomes. The presentation of the various figures is not justified.
pages 20,21: some of these papers have been published in the meantime, please check.
- End of comments.
|