Review Comment:
The paper proposes a framework for continuous top-k query answering over streaming and evolving distributed data. The research problem that is being tackled is a challenging one, since it considers both the velocity of streaming data, and the growth of static (or distributed) data.
The framework extends the previous approaches by Yang et al. [1] and Dehghanzadeh et al. [2]. The new additions are: the Super-MTK+N list as a data structure, the MinTopk+N algorithm, the AcquaTop algorithm, and two replica maintenance policies (i.e., MTKN-F and MTKN-T).
Experimental evaluations were provided, which investigated different maintenance policies for stream queries over Twitter data. The experiments were reported to confirm that the MTKN-F policy maximizes the coverage accuracy of the top-k result, while MTKN-T maximizes the relevancy.
-----
I believe that the paper at its current form cannot be accepted without a major revision. The following are the main points of my review:
# More convincing motivation
I am not yet fully convinced that the addition of the ability to handle slowly evolved data would have significant benefits to the state-of-the-art. How often does data slowly evolve, and why slowly, not quickly? What can be potential real-world applications of the proposed approach for handling slowly evolved data? What can be concrete real world problems that are so severely limited without the addition of the ability for tackling slowly evolved data? Also, more elaborated distinctions with [1,2] are necessary.
# Experimental evaluations
I believe that measuring response time would be a crucial aspect in the experimental evaluations, in particular since being reactive plays a key role in processing streaming data. Yet, the experimental evaluations only measure result relevancy. Moreoever, measuring the overhead of the proposed approach in the case of non-evolving data would provide a valuable insight as to when to adopt the proposed approach. Comparative experimental evaluations to [1,2] would be a valuable addition, too. Also, having a stress-test case in the experiments can show to what extent the proposed approach can still work reasonably.
# Quality of writing
There are many typos (see detailed remarks). Some sentences are too long (and too complex). Some parts are too dense, for example, Sec. 4.3.2. Some parts of writing lack sufficient examples to grasp the main ideas (see detailed remarks).
-----
Below are my detailed remarks regarding the presented work:
# There is no explicit definition of top-k queries, as: A top-k query is defined as ...
# Can "reactive" be further elaborated? Is it equivalent to fast query processing? How does it differ to query processing over static data that is fast? Any concrete examples of query executions that are not reactive vs. reactive?
# What are your assumptions regarding the poor network conditions for AcquaTop? How poor is poor? There must be some lower bounds of the latency, bandwidth, and rate for your approach to still work reasonably.
# Related work: The problem of finding the most relevant answers is very similar to that of getting the most relevant documents in the information retrieval domain. There, the data is vastly distributed, and quickly evolved. Can you relate your approach to theirs?
# How much overhead is introduced by your approach when the distributed data does not change, particularly compared to [1,2]?
# The example in Listing 1 is a bit problematic. There you assume that mention counts are immediately available in the stream. Yet, in reality, the mention counts themselves are results from some other aggregations, which are also challenging to compute due to the size and quick evolution of social networks. I'd say that queries for computing mention counts also deserve some attention regarding their implementation.
# Pg. 2: "and they would recompute the result from scratch risking to lose reactiveness" -> Can you provide experimental evaluations that show some query execution time comparison of your approach to the baseline approach where results have to be recomputed from scratch?
# Grammar: Is data singular or plural? The important note is to be consistent with the usage.
# As the paper is directly extending [1,2], I'd expect also some performance comparison to the previous work. In the case where the previous work is not compatible to the new experimental settings (say, since now the distributed data slowly evolves), you might want to add some reasonable naive/baseline modifications to the previous work in order to make it work.
# The naming of the maintenance policies is somewhat misleading.
MTKN-F is said to maximize accuracy. However, it tries to get all top-k answers while ignoring the ordering. If the ordering is ignored, it becomes inaccurate, doesn't it? I believe "coverage" is a more suitable term.
# The sectioning can be further improved by breaking down Sec. 4 into two sections.
# What happens if the distributed data evolves quickly? What are the limitations of your approach? Moreover, what is the main motivation in considering slowly evolved data, instead of any evolved data in general? In practice, how common/rare is it to encounter slowly evolved data?
# Sec. 2.1.1 and 2.1.2 lack examples needed to help understand the formalization.
# In order to be self-contained, at the end of Sec. 2.1.1, briefly explain QF.
# The metric DCG@K was given without sufficient intuitions and descriptions. For example, what is rel_i?
# Sec 2.2: I believe distributed datasets haven't been introduced, yet they are mentioned here.
# The 1:1 join relationship seems like quite a restriction. In practice, how often/rare do 1-1 joins occur?
# F(XS, XD) is monotone. However, the values of XS and XD themselves may increase and decrease overtime. This might be discussed to clarify the matter.
# Fig. 2.a is just an example, or is it actually the generic form of SE? If it is the generic form, then I suppose that having multiple windows and multiple services is not allowed?
# At the end of Sec. 2, there is the idea of computing error by comparing Ans(RQi) and Ans(Qi). Can examples be provided?
# Pg. 7: For instance, objects E, C, and B are in the top-3 result of window W_0 -> There is no W_0 in Fig. 3? There is also no B?
# Fig. 3 is a bit too small. Moreover, the object scoring is not shown explicitly in the figure.
# Pg. 7: It appears to me that the MTK set relies on the assumption that previously top-k results are very likely to appear in future top-k results. Since your work is based upon the MTK set, how robust is this assumption, both from the theoretical point of view, and practical point of view? What are the main limitations of this assumption? Generally, I think all results have the potential to appear in future top-k results.
# Sec. 3.2: The current writing is still not really accessible. Perhaps, an example might be introduced to improve the readability as to how replicas are updated.
# Reactiveness is indeed an important measurement for querying over data streams. Yet, the experimental evaluations are missing such a measurement.
# Pg. 9: Why is it sufficient to keep only the minimum value of the scoring variable X_S?
# Theorem 3 seems to be the hardest case out of the two previous theorems. How often does this occur, in the experiments and in practice?
# Pg. 10: What exactly is the difference between Super-MTK+N list, and MTK+N list? Can there be illustrations?
# Fig. 5 can be made more detailed, especially on the components of Topk+N.
# Algorithm 1: How does one get changed objects from the distributed datasets? How can one detect changes? How easy is it to detect changes?
# Sec. 4.3.2: "... thus the LBP set is recomputed from scratch" -> Is the set usually small?
# Sec. 4.3.2 can be made more accessible by first having a motivation and intuition of the proposed algorithm, before presenting the details.
# Having presented the algorithms, perhaps discuss what is new compared to [1].
# Perhaps when discussing Alg. 4, also highlight the similarities and differences to Alg. 1.
# In the experiment setup, why are ties not preferred? What happens if there are ties? Would ties affect the performance?
# As a stress-test, what happens when there are many changes for each invocation, that are more than 80?
# Sec. 5.4: It seems that the figure shows some elbow point which is a point where the performance starts to rise slowly as the refresh budget increases. Can this elbow point be elaborated more?
# Sec. 5.4: What can be reasonable refresh budgets? What are the parameters influencing how big should we set the budgets?
# Sec. 5.6: Fig. 10.b and 10.d show an interesting behavior: the performance is the worst when K is in the middle. Any argument why?
# Sec. 5 lacks a closing discussion that elaborates and relates observations from the four experiments. Perhaps some parts of the conclusion can be made as a closing discussion.
# The experiments were done over Twitter. What are the assumptions on the characteristics of Twitter data streaming? Are these assumptions reasonable? Can they be generalized so that the experiments would provide some insights also to other kinds of data streaming?
# There are many typos, some of which are:
- Pg. 2: loosing -> losing
- Pg. 2: guaranteing -> guaranteeing
- Pg. 3: we presents -> we present
- Pg. 6: to proposed
- Pg. 9: we can keeping
- Pg. 17: it compute
- Pg. 19: Abising notatoion
- etc
|