Abstract:
Continuously finding the most relevant (shortly, top-k) answer of a query that joins streaming and distributed data is getting
a growing attention. In recent years, this is in particular happening in Social Media and IoT. It is well known that, in those
settings, remaining reactive can be challenging, because accessing the distributed data can be highly time consuming as well
as rate-limited. In this paper, we investigate the problem of continuous top-k query evaluation over a data stream joined with
a distributed dataset in even a more extreme situation: the distributed data evolves. We propose the Topk+N algorithm and the
AcquaTop framework. They keep up to date a local replica of the distributed dataset and guarantees reactiveness by construction,
but to do so they may need to approximate the result. Therefore, we propose two maintenance policies to update the replica: the
Top Selection Maintenance (AT-TSM) policy maximizes the relevancy, while the Border Selection Maintenance (AT-BSM) policy
maximizes the accuracy of the top-k result. We contribute a theoretical proof of the correctness of Topk+N algorithm and we
study its complexity. Moreover, we provide empirical evidence that the proposed policies within AcquaTop framework produce
more relevant and accurate results than the state of the art.