A sustainable open data platform for air quality data

Tracking #: 2424-3638

Raf Buyle
Brecht Van de Vyvere
Dwight Van Lancker
Eveline Vlassenroot
Mathias Van Compernolle
Stefaan Lefever
Pieter Colpaert
Peter Mechant
Erik Mannens

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
Smart Cities need (sensor) data for better decision-making. However, while there are vast amounts of data available about and from cities, an intermediary is needed that connects and interprets sensor data on Web-scale. Today, governments are struggling to publish open data in a sustainable, predictable and cost-effective way. Our research question considers what methods for publishing and archiving linked open data time series, in specific air quality data, are suitable in a sustainable and cost-effective way. Based on a scenario, co-created with various Flemish governmental stakeholders, we benchmarked two Internet of Things reference architectures (FiWare Quantum Leap API and Linked Time Series Server) and calculated the cost for both data publisher and consumer. Results show that applying Linked Times Series on public endpoints for air quality data will lower the cost of publishing and will raise availability because of a better Web caching strategy.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 13/Apr/2020
Major Revision
Review Comment:

/??????The paper addresses the problem of publishing Air Quality data sustainably. The paper is well written, and it addresses an important problem. It provides an assessment of the performance for two alternative publication interfaces, i.e., one based on Quantumleap API and one based on TriplePatter Fragments.

Although I value the work very much, I believe it still requires improvement. I found minor and majors issues that impact my overall evaluation.

In the following, I first list the minor shortcoming.

- Cost versus resources: the authors initially discuss the intent of cost reduction for publishing air quality data. However, they only describe the advantages of publishing (which is very good), but they do not quantify the current costs. Additionally, when it comes to evaluating a solution, they measure resource consumption. Although it is undeniable that a relation between resource and cost exists, this is left implicit. This is not enough for the following reasons
(i) the authors adopted Kubernetes to deploy their solution, so it is possible at least to quantify the cost of deployment onto different cloud providers.
(ii) it is unknown the average load that the platform usually has, so we do not know if the improvement is sufficient to sustain a decrease in the fixed costs
(iii) the authors forget the costs to install a new solution and the consequent update on the client sides.
For these reasons, I suggest to reformulate the motivation in terms of a performance assessment rather than discussing a very complex (and sometimes subjective) matter like costs.

- RDF Streams: the authors make very clear they do not focus on real-time. Nevertheless, I find the research efforts like [1,2,3,4] should be at least mentioned as very related—especially those that attempt to extend triple-patter fragment for stream processing.

- Lack of examples and overly complex structure. The paper is well written. However, some passages are unnecessarily complex and would benefit from the introduction of an example.

In Section 2.3, it is a bit hard to distinguish what is the proposed solution. I understood the authors wanted to highlight the benefit of the selected technology w.r.t. standard linked data. My suggestion is clarifying the subject of each paragraph.

Section 3 is unnecessarily complex in the structure and lacks an example that can guide the reader to comprehend the problem. The presented use-cases follow a very common structure. 1) interval definition, 2) request, 3) response (4) error (if possible). The main difference is the nature of the request and the response (Fragment + iterator), Snippet (+ iterator/neighbors). I suggest generalizing the structure of the use case and add an example. Additionally, it would be beneficial to align the use-cases with the scenarios used in the evaluation. Figure 3 goes into the direction I suggest but requires improvement. The data models and various formulas are not explained, and I suspect an error in U2 (ls1 is everywhere).

Finally, the paper presents a significant issue in the evaluation. Being this the most important contribution of the work, it is of paramount importance to improve this section.

The KPI selection is compelling; however, what is missing is its relation to costs. This task is left to the reader without providing the necessary information, e.g., the average workload that this platform should have.

I did not like how latency is presented. The selected representation makes it hard to pass the presented information. I suggest to substitute them with histograms presenting the average latency per request w.r.t. The number of concurrent clients. Plus (error or max/min). This would easily convey the scalability and would make possible a direct comparison of the two approaches (a single figure that compares the histograms).

A note on adoption. It is unclear to me what the clients do. Would it be possible to describe their workload in a more precise manner (e.g., SPARQL query)?

Moreover, some figures are hard to read (e.g., Figure 16), and the scale-up 100 to 400 concurrent clients leave doubts about the behavior for intermediate requests.

Last but not least, the overall evaluation section lacks a diagnostic look at the performance results. Ideally, I would like the authors to enrich the measurements with knowledge about the internals of the solutions. It is vital to raise the level of the analysis for a paper that has a performance comparison as a key-contribution.

In the final section the authors wrote
"The benchmark showed that the Linked Time Series
approach lowers the cost for publishing air-quality
data and raises availability because of a better caching

Although I agree that "The results of the benchmark ascertain that
— with an increasing number of clients — the LTS
API has a lower CPU load than the Fiware Quantumleap API."

The former claim is, IMHO, not backed up by sufficient evidence.
Therefore, I suggest a major revision.

[1] Mauri, Andrea, et al. "Triplewave: Spreading RDF streams on the web." International Semantic Web Conference. Springer, Cham, 2016.

[2] Tommasini, Riccardo, et al. "VoCaLS: Vocabulary and Catalog of Linked Streams." International Semantic Web Conference. Springer, Cham, 2018.

[3] Keskisärkkä, Robin, and Eva Blomqvist. "Event Object Boundaries in RDF Streams-A Position Paper." ISWC 2013 Workshop: Proceedings of OrdRing. 2013.

[4] Taelman, Ruben, et al. "On the Semantics of Tpf-qs towards Publishing and Querying rdf Streams at Web-scale." Procedia Computer Science 137 (2018): 43-54.

Review #2
By Aidan Hogan submitted on 20/Apr/2020
Review Comment:

The paper describes a platform for managing and querying data relating to air quality using Linked Data Fragments. In particular, the paper explores a use-case where different sensors of air-quality are producing time series of data, where it is assumed that a high number of clients wish to apply aggregations over the raw data (in terms of taking slices of readings within a time period, averaging sensor readings in a particular geographic bounding box, etc.). After a comprehensive introduction and background on the standards, infrastructure and policies regarding the management of such data -- as well as the technical proposals on how it can be solved -- the paper describes in detail the primary use-case, which involves roaming sensors of air-quality. The paper then introduces two alterative technical solutions for supporting the aforementioned client queries/aggregations over these data: using Linked Time Series (based on Linked Data Fragments), and using FIWARE QuantumLeap (with FIWARE being promoted by the EC). These are benchmarked with respect to four types of queries: the 100 most recent observations; the absolute sensor values for a month (open interval); the absolute sensor values for a month (closed interval); and the average sensor values for an hour (closed interval). Metrics include the memory cost, CPU usage of API/DB, and query latency for a variable number of clients. The results generally show that Linked Time Series outperforms FIWARE QuantumLeap, particularly in scenarios involving a high number of clients.

The paper has been submitted as part of the "Storing, Querying, and Benchmarking the Web of Data".

In terms of positives:

P1: The paper describes a potentially very interesting use-case for Linked Data technologies, involving dynamic data being produced and consumed in a decentralised fashion (though being made available through a centralised API). The scenario implies non-trivial technical challenges with respect to the volume of data, how dynamic they are, as well as the inherently decentralised way in which data are both produced and consumed. Work along these lines -- putting the ideas of the community into practice and evaluating them -- is of key importance.

P2: The paper provides detailed metrics, particularly in terms of supporting very high numbers of clients, which reflects an interesting real-world scenario involving clients who wish to use air-quality measures to perform route planning to avoid contaminated areas. This scenario arises in the context of a collaboration with the Flemish government, and in general reflects promising work.

As a general remark:

R1: Though the paper is submitted as a full research paper, I do not think that this paper meets the criteria for that call because it does not develop any novel techniques per se, but rather applies and evaluates existing techniques in the context of a particular use-case. I don't see that the paper fits any of the submission types targeted by the regular call. I think the paper is closest to a benchmarking paper, which, though not part of the regular call of the Semantic Web Journal, probably fits the special issue. On the other hand, it does not meet some important criteria for a benchmarking paper either.

In terms of weaknesses:

W1: The key technical contribution of the paper appears to me to be the experiments, whose goal is to evaluate the performance of an LDP-based approach versus a FIWARE baseline. While there are certainly commendable aspects about the experiments, a major weakness of the reported results is that they are far from being reproducible for two main reasons. First, no code, no data (generator), no queries, etc., are provided online. Second, and more importantly, the paper provides very little detail on the actual experiments run, and the details that are provided are either too high-level, or difficult to follow. The paper does not provide any examples of concrete data to have an idea of what is being produced or indexed, nor does it provide details of how the timelines are generated, nor how much data is being generated, nor if the data are synthetic, how many sensors are considered, etc. It's also not clear to me how the producers/consumers are distributed on the machines described. Furthermore, though the paper informally describes the queries, it does not provide concrete details. One can infer certain details perhaps, but those details are guesswork. For example, it is not precisely clear what the difference is in practice between queries that are open and closed in their interval (in the open case, the end is a "Now" keywords?). Finally, there's not really all that much detail on how the systems themselves operate, which makes the results more difficult to interpret. Just to remark that the idea of "reproducibility" is not only important in the sense of whether or not someone can physically replicate the results (which rarely happens in practice), but also whether or not the results are "well defined", which is to say if there's not enough details to (in theory) reproduce the results, then there's probably not enough detail either to giving meaning to the results in the paper.

W2: The technical details that are provided I fortunately found unclear. For example, the most concrete description of the queries that I found was that of Figure 3, but unfortunately I could not make sense of the details presented. For example, the description of u3 states that it provides the average values within a bounding box, but where is the bounding box in the description of the figure? Why are the values for u3 averaged over all the timestamps if the goal is to produce a time series? Why in u2 are the same values repeated three times? In summary, technical details are sparse, and those that are provided are often unclear.

W3: As a journal paper that appears to focus on an experimental contribution, only comparing the performance of two systems should be recognised as a weakness. The authors argue that the baseline that is chosen is "ubiquitous", and that the FIWARE platform is being pushed by the EC, but I don't understand why QuantumLeap is seen as a "reference architecture" or how it can be established as a state-of-the-art system. From searching online, I don't get the impression that it is "ubiquitous". The general point here is that it is not clear if the benchmark/baseline involves state-of-the-art systems or not. Only comparing two systems seems limited for a paper mainly focused on experimentation.

W4: I think my main concern is that the benchmark tasks appear trivial in my understanding (though admittedly I may have misunderstood because the technical details are unclear). Three of the queries involve an atomic index lookup and enumerating N results. One of the queries involves computing an average over N results. I understand that the number of clients, the frequency of updates, etc., still presents a challenge, but assuming the data are indexed centrally in the DB (which is my understanding), none of these operations can be optimised beyond O(N). They are precisely as expensive as reading the data once, which I understand is needed to send the data to the client anyways. The only possible performance difference I can speculate could cause the results observed in (b4) is the CPU cost for the clients to parse/deserialise the data; computing an average (keeping a running count + sum and dividing the two at the end) is negligible. Also, I did not really understand why nginx is not helping in the case of QuantumLeap; how are the queries actually different between the clients?

Though this type of research is very important for the community, I think the current paper does not meet the requirements for publication at this venue. Put another way, this is promising work, but I think it is too preliminary for a journal paper. As a research paper it does not provide new techniques, but rather evaluates existing techniques in a specific (and important) context. As a benchmark/experimental paper, it has a number of limitations, including a lack of detail and thus reproducibility; a comparison of only two systems; also I am concerned that the benchmark tasks defined perhaps seem trivial (though admittedly I may have misunderstood).

I think with some more work putting these ideas into practice and developing some lessons learnt as a result, this could make for a great "In Use" paper at a conference like ISWC. Perhaps with more work on the details of the benchmark, adding some more systems, publishing the resources online, etc., it could also make for a nice "Resources" track paper at ISWC. I think though that the work required for the paper to be acceptable for this journal goes beyond a Major Revision, and hence my recommendation is for a Reject.

Minor comments:

- In general a lot of the operations described in the benchmark appear similar to OLAP / Data Cubes. Have you considered using such systems on the backend? (Also this looks like a workload for which a good column store might be well-suited.)

- "in [particular] air quality data"

- "in cities [because] of"
- "re-users" Not sure I like this term, unless it has a technical connotation somehow I don't know of?
- "Our research starts with the assumption ..." I think you should probably call this a speculation or a hypothesis.
- "We point[ ]out methods"

Section 2:
- "a time series of sensor measurements ..." I understand this is by Nittel, but this abstract definition makes little sense without context. The surrounding text seems to suggest that m_{sj} is a time series, but my best guess is that it's not, rather it's a reading from a sensor (sj) taken at time t_i and location l_{sj}, with each of v_1, ..., v_n being a value read from each of its n sensors. I think an example would help here to ground the definitions.
- "it is expected [that]"
- "this represents over 126 billion samples" Collected over what period of time?
- "These barriers, referred to as legal IOP" I think legal IOP is the goal? Or that the barriers "obstruct" legal IOP?
- "covers [the] interconnection"
- "Jim Hendler" -> "Hendler"
- Inconsistent capitalisation: FIWARE, Fiware; Quantumleap, QuantumLeap; etc.
- "join[ed] forces"
- "According to Haller" [ref]?
- "actions[. T]hese hypermedia"
- "contrary to regular HTTP servers resources are not exposed" I am not sure this is true. I think the point is just that the requests can be so specific (and there are so many ways to express the same request) that cache hits are likely to be low.
- "leverages [] HTTP"

Section 3:
- "the query [into] subqueries"
- A lot of terms here are used without being introduced: "tracks", "main success scenario", "primary actor", "extension scenarios". Explain these concepts to make the paper more self-contained.
- "According to Vander Sande ..." It seems that some of these factors are dependent, perhaps even redundant? For example, caching appears to be a way to minimise cost and response times (it appears to be a "how", not a "what"; or a "means", not an "ends").
- "are regularly using a native protocol" Unclear. I would suggest to rephrase.
- Some of the results hint at selective presentation of results. For example, in Figure 8, the results go from 100, to 500, to 1000, to 1300. My guess is that perhaps at that number of clients the Linked Time Series Server starts to timeout? The authors should define in their experimental settings the number of clients that will be tested and run those consistently across all experiments, up to the minimum number that times out, and report those time outs. Changing the parameters between each experiment gives the (perhaps false) impression that the paper avoids presenting time outs for the Linked Time Series server, despite the fact that such results are clearly relevant!
- "AP" -> "API"
- 4.3.4: First paragraph is not justified in alignment.
- The bandwidth figures from 4.3.3 are repeated in 4.3.4. Is this correct?
- "The load on the clients of the Fiware Quantum Leap API is 185 times higher ..." Is this correct?
- "... is fully cacheable" Again without further technical details, I have no idea why this is the case.
- "The bandwidth of the LDF endpoints is slightly higher" But it's sometimes more than an order of magnitude higher? Is this a fair characterisation?
- "Next the Open World Assumption (OWA) ..." This sentence is a bit strange somehow. Citing OWA here seems a bit specific as you are not applying any reasoning or formal interpretations over the data.

Review #3
By Jean Paul Calbimonte submitted on 29/May/2020
Major Revision
Review Comment:

This paper analyzes the feasibility of deploying and maintaining an open data platform for urban air quality monitoring, considering aspects related to sustainability and long term support. Although in the introduction the paper mentions the issues of sustainability and cost-effective solutions for air quality management, then the paper only really discusses in detail a single aspect, which is related to caching strategies at the Web API level, through the use of the LD fragments approach. This issue is only one small part of the entire problem of making available air quality open data in a sustainable manner.
Although I agree that there is interest in exploring this very specific topic, the authors may require to set the scope of the paper early on in order to avoid deceiving the reader. In fact, the authors state that their main research question is to study how "city governments can develop a sustainable method for publishing and archiving open data" for air quality sensors. This research question is way too large in comparison with the actual contribution of the paper which is more on benchmarking and comparing the LDF approach with alternatives, focusing on caching strategies of LOD.

This work may consider reviewing several relevant works in the state of the art [7][8][9][10], which have addressed challenges related to those described in this paper, in different ways, with many take-away lesson learned that could be very relevant for this work.
For example, the authors mention the importance of making available Air quality information as Open Data. However, I am not sure this is the most cost-effective and useful strategy for public administrations. For example, in previous Air quality data experiences [2] live sensor data was used to generate high resolution air quality maps. In fact, for most of the public industries, administration and citizens, making these air quality models available as Open Data is far more useful than accessing the raw observations from the sensors. There are many reasons leading to this result, including the fact that mobile 'cheaper' sensors may need autocalibration (e.g. by adjusting measurements through proximity approaches [3]), or simply because raw data are too messy and huge in terms volume.
In this context, the queries proposed in this paper are way too generic and make sense if we simply want to compare very general query response behaviors. However, they do not reflect very well the needs of real problems in air quality monitoring. For example, asking for the 'latest 100 observations' is a generic query with little meaning. Is it about 100 observations from any sensor at any location? What is the interest of such query? Air quality data is generally needed within a given context. For example, to evaluate the air quality levels of a certain street, the query should be restricted to a spatial boundary defined by street segments, usually also within certain time boundaries. The time interval query scenarios described in the paper are in that sense more interesting, although a more realistic spatial boundary is still missing. In all cases, the scenarios presented do not take into account that raw observations per se are generally not that useful, as they generally need to go through processing layers before being usable by third parties [4].

Live sensor data can be more useful in those cases where the prediction models find their limits. This may be the case in very exceptional scenarios where the general trends are broken: e.g. large road accidents. IT can also be the case that anomalies can be detected (e.g. sensors failing or reporting wrong values for some reason). The authors could try to come up with more realistic scenarios where continuous queries such as these are put into place. Alerts and anomaly detection are potential interesting scenarios.

The authors also mention several times the need for cost effective solutions. Although this is maybe out of the scope of this work, it is worth mentioning that modelling techniques can also deliver cheaper and very accurate air quality models for urban environments. Making these models available as Open Data can be more effective, accurate and cheaper than the live sensor data approach. [5] [6].

The question of cost efficiency and sustainability in Air Quality Open Data is much larger than the study that is performed in this paper. Only considering the aspects related to sensor deployment, data transmission to intermediary stations, calibration, sensor density compensation, or data quality, we have a much more complex problem that has much more serious consequences than the choice of Fiware vs. LDF.

Overall the paper essentially compares a LDF approach with a Fiware API in terms of latency and memory consumption in a set of predefined basic queries. The study is interesting on its own and I see the value of performing this evaluation. However, I think it would be important to set the scope of this paper from the beginning, and very precisely indicate that only a particular aspect related to the efficiency of query APIs will be studied. This translates among other things into a better formulated research question, and better scoping of the motivation.

Minor details

The title is -to a certain degree- misleading. It makes the reader think this paper is about a proposed sustainable platform, but instead it is about benchmarking existing infrastructures.

To avoid […] cities must become "Smart" -> this statement is not really precise. The concept of smart cities can contribute to avoid crisis but will not per se stop it form happening, it is part of a larger endeavor.

[…] which route to take in order to avoid […] -> authors may mention this work [1]

IOP or IoP? -> maybe this should be standardized throughout the paper.

Below, the competing vocabularies -> the table is not below……

[1] David Hasenfratz, Tabita Arn, Ivo de Concini, Olga Saukh, and Lothar Thiele. 2015. Health-optimal routing in urban areas. In Proceedings of the 14th International Conference on Information Processing in Sensor Networks (IPSN ’15). Association for Computing Machinery, New York, NY, USA, 398–399.

[2] Mueller, M. D., Hasenfratz, D., Saukh, O., Fierz, M., & Hueglin, C. (2016). Statistical modelling of particle number concentration in Zurich at high spatio-temporal resolution utilizing data from a mobile sensor network. Atmospheric Environment, 126, 171-181.

[3] Olga Saukh, David Hasenfratz, and Lothar Thiele. 2015. Reducing multi-hop calibration errors in large-scale mobile sensor networks. In Proceedings of the 14th International Conference on Information Processing in Sensor Networks (IPSN ’15). Association for Computing Machinery, New York, NY, USA, 274–285.

[4] Hasenfratz, D., Saukh, O., Walser, C., Hueglin, C., Fierz, M., Arn, T., ... & Thiele, L. (2015). Deriving high-resolution urban air pollution maps using mobile sensor nodes. Pervasive and Mobile Computing, 16, 268-285

[5] Berchet, A., Zink, K., Muller, C., Oettl, D., Brunner, J., Emmenegger, L., & Brunner, D. (2017). A cost-effective method for simulating city-wide air flow and pollutant dispersion at building resolving scale. Atmospheric Environment, 158, 181-196.

[6] Berchet, A., Zink, K., Oettl, D., Brunner, J., Emmenegger, L., & Brunner, D. (2017). Evaluation of high-resolution GRAMM–GRAL (v15. 12/v14. 8) NO x simulations over the city of Zürich, Switzerland. Geoscientific Model Development, 10(9), 3441

[7] Calbimonte, J. P., Eberle, J., & Aberer, K. (2017). Toward self-monitoring smart cities: the opensense2 approach. Informatik-Spektrum, 40(1), 75-87.

[8] Hu, K., Rahman, A., Bhrugubanda, H., & Sivaraman, V. (2017). HazeEst: Machine learning based metropolitan air pollution estimation from fixed and mobile sensors. IEEE Sensors Journal, 17(11), 3517-3525.

[9] Aberer, K., Sathe, S., Chakraborty, D., Martinoli, A., Barrenetxea, G., Faltings, B., & Thiele, L. (2010, November). OpenSense: open community driven sensing of environment. In Proceedings of the ACM SIGSPATIAL International Workshop on GeoStreaming (pp. 39-42).

[10] Devarakonda, S., Sevusu, P., Liu, H., Liu, R., Iftode, L., & Nath, B. (2013, August). Real-time air quality monitoring through mobile sensing in metropolitan areas. In Proceedings of the 2nd ACM SIGKDD international workshop on urban computing (pp. 1-8).