Introduction
SPARQL endpoints, that expose Linked Data on the Web through a query-based interface, tend to suffer from availability issues [1]. In comparison to most other HTTP servers, SPARQL endpoints require high-end computational resources due the high complexity of SPARQL queries and can thus be difficult to sustain when a number of concurrent clients request query execution. In order to cope with this problem, the Linked Data Fragments (LDF) effort [2] has been initiated as a conceptual framework to investigate alternative query interfaces to publish Linked Datasets, by redistributing the effort of query evaluation between servers and clients.
LDF interfaces allow some parts of the query to be performed on the server, and some on the client, which leads to a redistribution of effort between server and client. This redistribution requires queries to be decomposed into multiple smaller queries, which typically leads to slower query execution due to the HTTP overhead of these roundtrips, compared to fully server-side query execution. In order to reduce this number of smaller queries, servers could send a pre-filter to the client, which could potentially eliminate many of these queries. The focus of this work is investigating such pre-filters.
In recent years, different kinds of these LDF interfaces have been introduced, such as Triple Pattern Fragments (TPF) [2], Bindings-Restricted Triple Pattern Fragments [3], SaGe [4], and Smart-KG [5]. Each of these types of interfaces introduce their own trade-offs in terms of server and client effort. Additionally, LDF interfaces can enable feature-based extensibility, which allows servers to optionally expose certain features as metadata through usage of self-descriptive hypermedia [6], which can then be detected automatically by supporting clients to enhance the query evaluation process. Due to the extensibility of TPF, several interface features have already been proposed for TPF [7, 8, 9]. One such feature is Approximate Membership Filter (AMF) [7] metadata, which supporting clients can use to reduce the number of HTTP requests, with only a slight increase in server cost. Unfortunately, this currently comes at the cost of slower query execution, because the individual HTTP requests were larger and more expensive to compute. Since TPF is quickly gaining adoption among publishers [10], we focus on improving the performance of AMF with TPF in this work. AMF could also be useful for other types of LDF interfaces, but we consider this out of scope for this work.
Even though the work on extending TPF with AMFs showed excessive overhead, we claim that these problems can be resolved, and that AMFs can be used to lower overall query execution times without significantly increasing server load. As such, the goal of our work is to investigate what changes are required server-side and client-side to optimize AMFs for TPF. Concretely, we introduce six dimensions through which the AMF approach from Vander Sande et al. [7] can be improved. One of these dimensions involves the introduction of a new client-side algorithm to handle AMFs. The other dimensions are related to the server-side handling of AMFs. The effects and feasibility of each of these dimensions are evaluated and analyzed in detail. In summary, our work brings a deeper understanding of the appliance and benefits of AMF metadata for Linked Data interfaces, so that Linked Data publishers can expose their Linked Datasets in a more efficient manner through TPF interfaces.
Problem Statement
The goal of our work is to optimize query execution over TPF interfaces using AMFs. We build upon the work from Vander Sande et al. [7], where the authors allowed the number of HTTP requests to be reduced at the cost of slower query execution. Our goal is to mitigate this major drawback, while retaining its advantages.
Vander Sande et al. introduced a number of follow-up questions that we use as a basis for defining our research questions. Concretely, in order to mitigate the earlier mentioned drawbacks, our research questions are defined as follows:
- Can query execution time be lowered by combining triple pattern AMFs client-side on larger Basic Graph Patterns (BGPs)?
Earlier work focused on using AMF metadata from triple pattern queries to test the membership of materialized triples, while there is potential for exploiting this for other types of patterns in the query as well. For instance, combining multiple AMFs at BGP-level by applying AMFs on triple patterns with shared variables. - To what extent do HTTP caching and AMFs speed up query execution?
As Vander Sande et al. suggest that caching of AMFs reduce server delays, we investigate the impact of caching HTTP requests and/or AMFs. - How does selectively enabling AMF impact server load and querying?
Earlier work introduced AMF as a feature that was always enabled. However, some specific AMFs may be too expensive for servers to calculate on the fly. As such, it may be beneficial to only enable AMF for queries that have a result count lower than a certain threshold. - How does network bandwidth impact query performance with AMFs?
In experiments by Vander Sande et al., the HTTP bandwidth was set to a realistic 1Mbps. However, there is still an open question as to what extent different rates have an impact on the importance of AMF. - How low can AMF false-positive probabilities become to still achieve decent client-side query performance?
Based on their results, Vander Sande et al. have suggested that additional experimentation is needed with regards to lower AMF false-positive probabilities, as higher probabilities did not have a significant effect on query performance. Note that query correctness is never affected, but rather the number of requests to the server, since every positive match requires a request to verify whether it is a true or false positive.
An answer to these research questions will be formed using the experiments from Section 5.
Client-side AMF Algorithms
In this section, we explain the existing triple-based AMF algorithm, and we show where it lacks. Following that, we introduce a new client-side BGP-based AMF algorithm that solves these problems. For the reader’s convenience, detailed examples of how these two algorithms work can be found on https://github.com/comunica/Article-SSWS2020-AMF/wiki/AMF-algorithm-examples. Finally, we introduce a heuristic that determines whether or not the BGP-based algorithm is beneficial to use.
Triple-based AMF Algorithm
Vander Sande et al. [7] introduced an algorithm that acts as a cheap pre-processing step for testing the membership of triples. This algorithm was used in combination with the streaming greedy client-side TPF algorithm [2] for evaluating SPARQL queries. Listing 1 depicts this algorithm in pseudo-code.
Concretely, every triple pattern that has all of its variables resolved to constants is run through this function right before a more expensive HTTP request would be performed. This function takes a triple and a query context containing the AMFs that were detected during the last TPF response for that pattern. It will test the AMFs for all triple components, and from the moment that a true negative is found, false will be returned. Once all checks pass, the original HTTP-based membership logic will be invoked.
BGP-based AMF Algorithm
Following the idea of the triple-based algorithm, we introduce an extension that applies this concept for BGPs. This makes it possible to use AMFs not only for testing the membership of triples, but also for using AMFs to test partially bound triple patterns that may still have variables. In theory, this should filter (true negative) bindings earlier in the query evaluation process.
Listing 2 shows this algorithm in pseudo-code. Just like the triple-based algorithm, it acts as a pre-processing step when BGPs are being processed. It takes a list of triple patterns as input, and query context containing a list of corresponding AMFs that were detected during the last TPF responses for each respective pattern. The algorithm iterates over each pattern, and for each triple component that is not a variable, it will run it through its AMF. Once a true negative is found, it will immediately return an empty stream to indicate that this BGP definitely contains no results. If all checks on the other hand pass, the original BGP logic will be invoked, which will down the line invoke more expensive HTTP requests.
Heuristic for Enabling the BGP Algorithm
While our BGP-based algorithm may filter out true negative bindings sooner than the the triple-based algorithm,
it may lead to larger AMFs being downloaded, possibly incurring a larger HTTP overhead.
In some cases, this cost may become too high if the number of bindings that needs to be tested is low,
e.g. downloading an AMF of 10MB would be too costly when only a single binding needs to be tested.
To cope with these cases, we introduce a heuristic in Listing 3,
that will estimate whether or not the BGP-based algorithm will be cheaper in terms of HTTP overhead
compared to just executing the HTTP membership requests directly.
Concretely, the heuristic checks if the size of an AMF is lower than the size of downloading TPF responses.
This heuristic has been designed for fast calculation,
with exactness as a lower priority.
Based on measurements, we set AMF_TRIPLE_SIZE
to 2 bytes,
and TPF_BINDING_SIZE
to 1000 bytes by default.
In Section 5, we will evaluate the effects for different TPF_BINDING_SIZE
values.
In future work, more exact heuristics should be investigated
that take perform live profiling of HTTP requests and delays to avoid the need of these constants.
Evaluation
The goal of this section is to answer the research questions from Section 3. First, we briefly discuss the implementations of our algorithm. After that, we present our experimental setup, and we present our results. All code and results results can be found on GitHub.
Implementation
For implementing the client-side AMF algorithms, we make use of the JavaScript-based Comunica SPARQL querying framework [18]. Since Comunica already fully supports the TPF algorithm, we could implement our algorithms as fully standalone plugins. Our algorithms are implemented in separate Comunica modules, and are available open-source on GitHub. Concretely, we implemented the original triple-based AMF algorithm, our new BGP-based AMF algorithm (BGP Simple), and a variant of this BGP-based algorithm (BGP Combined) that pre-fetches AMFs in parallel.
The original TPF server extension in the LDF server software by Vander Sande et al. [7] allowed both Bloom filters and GCS to be created on the fly for any triple pattern. To support our experiments, we extended this implementation with new features. This implementation is available on GitHub. In order to measure the server overhead of large AMFs, we added a config option to dynamically enable AMFs for triple patterns with number of matching triples below a given result count threshold. Next to that, we implemented an optional file-based cache to avoid recomputing AMFs to make pre-computation of AMFs possible.
Experimental Setup
Based on our LDF server and Comunica implementations that were discussed in Subsection 5.1, we defined five experiments, corresponding to our five research questions from Section 3. These experiments are defined and executed using Comunica Bencher, which is a Docker-based benchmark execution framework for evaluating Linked Data Fragments. This enables reproducibility of these experiments, as they can be re-executed with a single command.
The following experiments execute WatDiv with a dataset scale of 100 and a query count of 5 for the default query templates, leading to a total of 100 queries. We only report results for Bloom filters for experiments where no significant difference was measured with GCS. Each experiment includes a warmup phase, and averages results over 3 separate runs. During this warmup phase, the server caches all generated AMFs. For each query, the client-side timeout was set to 5 minutes and, to enforce a realistic Web bandwidth, the network delay was set to 1024Kbps. All experiments were executed on a 64-bit Ubuntu 14.04 machine with 128 GB of memory and a 24-core 2.40 GHz CPU—each Docker container was limited to one CPU core, behind an NGINX HTTP cache.
- Client-side AMF Algorithms: In this experiment, we compare different client-side algorithms (None, Triple, BGP Simple, BGP Combined, Triple with BGP Combined) for using AMF metadata.
- Caching: In this experiment, we evaluate the effects of caching all HTTP requests combined with caching AMF filters server-side, both following the LRU cache replacement strategy. We also compare the effects of using AMF metadata client-side or not. Finally, we test the effects of warm and cold caches.
- Dynamically Enabling AMF: In this experiment, we compare different result count thresholds (0, 1.000, 10.000, 100.000, 1.000.000) with each other, with either the server-side AMF filter cache enabled or not. We disable the HTTP cache and warmup phase to evaluate a cold-start.
- Network Bandwidths: Different network bandwidths (256kbps, 512kbps, 2048kbps, 4096kbps) are tested for evaluating network speedups, and their effects or different AMF algorithms (None, Triple, BGP Combined) are tested.
- False-positive Probabilities: In this final experiment, we compare different AMF false-positive probabilities (1/4096, 1/1024, 1/64, 1/4, 1/2).
Results
In this section, we present the results for each of our experiments separately. We analyzed our results statistically by comparing means using the Kruskal-Wallis test, and report on their p-values (low values indicate non-equal means).
Client-side AMF Algorithms
Approach | Requests | Relative requests | Cache hits | Cache hit rate |
---|---|---|---|---|
None | 1,911,845 | 100.00% | 1,686,889 | 88.23% |
Triple | 1,837,886 | 96.13% | 1,626,611 | 88.50% |
BGPSimple | 191,764 | 10.03% | 173,617 | 90.53% |
BGPCombined | 191,768 | 10.03% | 173,621 | 90.53% |
TripleBGP | 191,773 | 10.03% | 173,626 | 90.53% |
Fig. 3: Number of HTTP requests, number of HTTP requests relative to not using AMFs, number of cache hits and cache hit rate for the different client-side algorithms. BGP-based algorithms require significantly fewer HTTP requests.
Fig. 1 shows the query evaluation times for our first experiment on the different client-side algorithms for using AMF metadata. In line with what was shown in the first TPF AMF experiments [7], the triple-based algorithm reduces query evaluation times in only 2 of the 20 queries. Our new BGP-based algorithms on the other hand reduce query evaluation times and outperforms the triple-based algorithm. Only for 5 of the 20 queries, evaluation times are higher or equal. Our combined BGP algorithm is slightly faster than the simple BGP algorithm. By using both the combined BGP-based and the triple-based algorithms, we can reduce evaluation times slightly further.
Fig. 2 shows the query result arrival times for query F3, and is similar to the arrival times for other queries. This figure shows that the time-until-first-result is the highest for BGP-based AMF algorithms. However, once this first result comes in, the arrival rate becomes much higher compared to the other algorithms. This delay for the BGP-based algorithms is caused by the higher download times for large AMFs, and explains the higher or equal evaluation times for 5 of the 20 queries.
Fig. 3 shows the BGP-based algorithms significantly lower the number of required HTTP requests, which explains the significant reduction in query execution times. This allows the NGINX cache hit rate to slightly increase compared to the regular and triple-based TPF algorithms, since fewer requests are made, which lowers the number of required cache evictions.
Based on these results, there is no statistically significant difference between the evaluation times of the triple-based AMF algorithm, and not using AMF metadata at all (p-value: 0.93). The simple and combined BGP algorithms are significantly faster than not using AMF metadata (p-values < 0.01). Furthermore, the simple and combined BGP algorithm are on average more than twice as fast as the triple-based algorithm, which make them significantly faster (p-values < 0.01). Furthermore, combining our simple and combined BGP algorithm with the triple-based algorithms shows no further statistically significant improvement (p-values: 0.95, 0.67).
In Fig. 4, we show the results where we apply the heuristic for dynamically disabling the BGP heuristic based on different parameter values. On average, setting the request size parameter value to 2000 has the lowest average evaluation time for this experiment. This case achieves lower evaluation times for 19 of the 20 queries, which is an improvement compared to not using the heuristic. This improvement by itself however only small, and not statistically significant (p-value: 0.18).
Caching
Fig. 5 shows that caching either HTTP requests or AMF filters server-side has a significant positive effect on query evaluation times (p-value: < 0.01). We observe that caching HTTP requests reduces query evaluation times more than just caching AMF filters (p-value: 0.02). Furthermore, there is no significant difference between query evaluation times for caching of both HTTP requests and AMF filters compared to just caching HTTP requests (p-value: 0.77). This shows that an HTTP cache achieves the best results, and additionally caching AMF filters server-side is not worth the effort.
If we compare these results with the results for non-AMF-aware querying, we see that if HTTP caching is disabled, query evaluation times for non-AMF-aware querying are significantly lower than AMF-aware approaches (p-value: < 0.01). On the other hand, if HTTP caching is enabled, query evaluation times for non-AMF-aware querying are significantly worse than with AMF-aware approaches (p-value: < 0.01). While caching is already very important for TPF-based querying, these results show that caching becomes even more important when AMFs are being used.
Finally, our results show that when our cache is warm, exposing Bloom filters instead of GCS achieves faster query evaluation times. While there are a few outliers where GCS is two to three times slower, the difference is only small in most cases (p-value: 0.18).
Dynamically Enabling AMF
Fig. 6 shows lower server-side AMF result count thresholds lead to higher query evaluation times when HTTP caching is enabled (p-value: < 0.01). Fig. 7 shows that AMF result count thresholds also have an impact on query evaluation times when HTTP caching is disabled (p-value: < 0.01), but it does not necessarily lower it. For this experiment, setting the threshold to 10K leads to the lowest overall query evaluation times.
Fig. 8 shows that lower AMF result count thresholds lead to lower server loads when HTTP caching is disabled (p-value: 0.03). On the other hand, if HTTP caching is enabled, there is no correlation (Pearson) between AMF result count threshold and server CPU usage (p-value: 0.46). This shows that if caching is enabled, dynamically enabling AMFs based on the number of triples is not significantly important, and may therefore be disabled to always expose AMFs.
For this experiment, average CPU usage increased from 31.65% (no AMF) to 40.56% (all AMF) when caching is enabled. Furthermore, when looking at the raw HTTP logs, we observe that by always exposing AMFs, we use 28.66% of the total number of HTTP requests compared to not exposing AMFs. As such, AMFs significantly reduce the number of HTTP requests at the cost of ~10% more server load.
Network Bandwidth
Fig. 9, Fig. 10 and Fig. 11 show the effects of different bandwidths on query evaluation times over different algorithms. We observe that when not using AMF, or using the triple-level AMF algorithm, lower bandwidths lead to higher query evaluation times. However, when bandwidths become much higher, query evaluation times decrease at a lower rate. In contrast, the BGP-level AMF algorithm continuously becomes faster when bandwidth increases. We do not measure any significant impact of bandwidth on both non-AMF usage and triple-level AMF usage (p-values: 0.29, 0.23). For BGP-level AMF, we measure a significant impact (p-value: < 0.01). This shows that if BGP-level AMF is used, then higher bandwidths can be exploited more for faster query evaluation.
False-positive Probabilities
Fig. 13 shows that different false-positive probabilities have impact on query evaluation times. This impact has however only a weak significance (p-value: 0.18). On average, a false-positive probability of 1/64 leads to the lowest overall query evaluation times for this experiment.
Conclusions
In this article, we introduced client-side and server-side improvements to the AMF feature for TPF. The experimental results show that our client-side algorithms make average query execution more than two times faster than with regular TPF while only requiring 10% of the number of HTTP requests, at the cost of less than 10% additional server CPU usage.
We offer implementations of these algorithms and server enhancements, which means that they can be used by any of the existing data publishers that are exposing their data through a TPF interface, or any client that aims to query from them.
Hereafter, we conclude our findings with respect to our research questions, based on the evaluation, and we introduce a set of recommendations for data publishers using AMF with TPF.
Research findings
BGP-based Algorithms Improve Query Efficiency
Results show that our new client-side BGP-based algorithms that use AMF metadata significantly reduce query evaluation times (Research Question 1). However, the are a few outliers where our new algorithms perform worse than the triple-based algorithm. This is because AMFs are sometimes very large, which has a significant impact on query execution times when they have to be downloaded from the server. Our results have shown that a heuristic that can decide whether or not to use the BGP-based algorithm can solve this problem, but further research is needed to come up with a more general heuristic that works in a variety of cases.
BGP-based Algorithms Postpone Time to First Results
Even though total query evaluation times for the AMF-aware algorithms are mostly lower, we observe that the time-until-first-result is mostly higher. The reason for this is that the BGP-based algorithms tends to use larger AMFs, which introduces a bottleneck when requesting them over HTTP. Even though we have this overhead, the gains we get from this are typically worth it, as results come in much faster once the AMFs have been downloaded. This finding shows that dynamically switching between different algorithms may be interesting to investigate in future work. Our bandwidth experiment results confirm this bandwidth bottleneck when downloading large AMFs, and show that higher bandwidths lead to even more performance gains for the BGP-level algorithms (Research Question 4). This continuous efficiency can be investigated further in the future using metrics such as diefficiency [19].
Pre-computation and Caching of AMFs is Essential
Our results show that AMF-aware querying only has a positive impact on query evaluation times if the server can deliver AMF filters sufficiently fast (Research Question 2) by for example caching them. Furthermore, if no cache is active, AMF-aware querying performs worse than non-AMF-aware querying. Ideally, all AMFs should be pre-computed, but due to the large number of possible triple patterns in a dataset, this is not feasible. On the other hand, our results have shown that server-side on the fly creation of AMFs only starts to have a significant impact for sizes larger than 10.000 (Research Question 3).
On a low-end machine (2,7 GHz Intel Core i5, 8GB RAM), creation of AMFs takes 0.0125 msec per triple, or 0.125 seconds for AMF creation of size 10.000. As such, AMFs of size 10.000 or less can be created on the fly with acceptable durations for Web servers (after which they can still be cached).
Fig. 14 shows that there is only a very small number of triple patterns with a very large number of matches. When setting the WatDiv dataset to a size of 10M triples, there are only 90 triple patterns with a size larger than 10.000. Setting that size to 100M triples, this number increases to 255, so this is not a linear increase. Due to this low number of very large patterns, servers can easily pre-compute these offline before dataset publication time. Since the WatDiv dataset achieves a high diversity of structuredness, it is similar to real-world RDF datasets [20]. As such, this behavior can be generalized to other datasets with a similar structuredness.
Bloom Filters are Preferred over GCS with Active Cache
Results show that when AMFs are pre-computed, Bloom filters achieve faster query evaluation times than GCS (Research Question 2). This is because Bloom filter creation requires less effort client-side than GCS due to the simpler decompression, at the cost of more server effort. However, this higher server effort is negligible if AMFs can be pre-computed. As such, we recommend Bloom filters to always be preferred over GCS, unless AMFs can not be cached.
A Good Trade-off Between False-positive Probabilities and AMF Size
Lowering the false-positive probability of an AMF increases its size. As we have seen that larger AMFs have an impact on query evaluation times, we do not want AMFs to become too large. On the other hand, we do not want the false-positive probabilities to become too high, as that leads to more unneeded HTTP requests. Our results have shown that a probability of 1/64 leads to an optimal balance for our experiments (Research Question 5). However, further research is needed to investigate this trade-off for other types of datasets and queries.
Recommendations for Publishers
Based on the conclusions of our experimental results, we derived the following guidelines for publishers who aim to use the AMF feature:
- Enable HTTP caching with a tool such as NGINX.
- Pre-compute AMFs (or at least cache) AMFs of size 10.000 or higher.
- If AMFs can be cached, prefer Bloom filters over GCS.
- Use a false-positive probability of 1/64.