Javascript required
Skip to content Skip to sidebar Skip to footer

What to Use for Querying Low Dynamic Data

Abstract

Our society is evolving towards massive data consumption from heterogeneous sources, which includes rapidly changing data similar public transit delay information. Many applications that depend on dynamic data consumption require highly bachelor server interfaces. Existing interfaces involve substantial costs to publish rapidly changing data with high availability, and are therefore only possible for organisations that can afford such an expensive infrastructure. In my doctoral enquiry, I investigate how to publish and consume existent-time and historical Linked Data on a big scale. To reduce server-side costs for making dynamic data publication affordable, I will examine different possibilities to divide query evaluation between servers and clients. This paper discusses the methods I aim to follow together with preliminary results and the steps required to use this solution. An initial prototype achieves significantly lower server processing toll per query, while maintaining reasonable query execution times and client costs. Given these promising results, I feel confident this research management is a feasible solution for offering depression-toll dynamic Linked Information interfaces as opposed to the existing high-cost solutions.

Keywords

  • Linked Information
  • Triple Pattern Fragments
  • sparql
  • Continuous querying
  • Real-time querying

R. Taelman—Supervised by Ruben Verborgh and Erik Mannens.

Introduction

The Web is an important driver of the increase in data. This data is partially made up of dynamic data, which does not remain the same over fourth dimension, like for example the delay of a certain railroad train or the currently playing track on a radiostation. Dynamic data is mostly published as data streams [3], which tend to be offered in a push-based way. This requires information providers to have a persistent connection with all clients who consume these streams. On top of that, queries over real-time data are expected to be continuous, because the data are at present continuously updating streams instead of just finite stored datasets. At the same time, this dynamic data besides leads to the generation of historical data, which may be useful for information assay.

In this piece of work, I investigate how to publish and consume not-loftier frequency real-time and historical Linked Information. This real-time information for example includes sensor results which update at a frequency in the order of seconds, use cases that require updates in the lodge of milliseconds are excluded. The focus lies at low-toll publication, so that big scale consumption of this information becomes possible without endpoint availability problems.

In the next section, the existing work in the area will be discussed. Section 3 will explain the trouble I am trying to solve, after which Sect. 4 will briefly explain the methodology for solving this trouble. Section 5 will hash out the evaluation of this solution later on which Sect. 6 will nowadays some preliminary results. Finally, in Sect. 7 I will explicate the desired impact of this research.

Country of the Art

Current solutions for querying and publishing dynamic data is divided in the two generally disjunct domains of stream reasoning and versioning, which will exist explained hereafter. After that, a depression cost server interface for static data will exist explained.

Stream reasoning is defined as "the logical reasoning in real time on gigantic and inevitably noisy data streams in order to support the decision process of extremely large numbers of concurrent users" [4]. This area of research integrates data streams with traditional RDF reasoners. Existing sparql extensions for stream processing solutions like c-sparql [5] and cqels [10] are based on query registration [four, seven], which allows clients to register their query at a streaming-enabled sparql endpoint that volition continuously evaluate this query. These data streams consist of triples that are annotated with a timestamp, which indicates the moment on which the triple is valid. These querying techniques can for example exist used to query semantic sensor data [13]. c-sparql is a first approach to querying over both static and dynamic data. This solution requires the client to register a query in an extended sparql syntax which allows the use of windows over dynamic data. The execution of queries is based on the combination of a traditional sparql engine with a Data Stream Management System (DSMS) [2]. The internal model of c-sparql creates queries that distribute work between the DSMS and the sparql engine to respectively procedure the dynamic and static data. cqels is a "white box" approach, equally opposed to the "blackness box" approaches like c-sparql. This means that cqels natively implements all query operators, as opposed to c-sparql that has to transform the query to another language for delegation to its subsystems. This native implementation removes the overhead that black box approaches similar c-sparql have. The syntax is very similar as to that of c-sparql, too supporting query registration and time windows. According to previous research [ten], this approach performs much better than c-sparql for large datasets, for unproblematic queries and small datasets the opposite is true.

Offering historical data tin can exist achieved by versioning unabridged datasets [15] using the Memento protocol [xiv] which extends http with content negotiation in the datetime dimension. Memento adds a new link to resource in the http header, named the TimeGate, which acts equally the datetime dimension for a resource. It provides a list of timely versions of the resource which tin can be requested. Using Memento's datetime content negotiation and TimeGates, it is possible to do Time Travel over the web and browse pages at a specific bespeak in time. R&WBase [17] is a triple-store versioning approach based on delta storage combined with traditional snapshots. It offers a method for querying these versioned datasets using sparql. The dataset can be retrieved as a virtual graph for each delta revision, thus providing Memento-similar time travel without an explicity fourth dimension indication. TailR [11] provides a platform through which datasets can be versioned based on a combination of snapshot and delta storage and offered using the Memento protocol. It allows queries to retrieve the dataset version at a given fourth dimension and the times at which a dataset has changed.

Triple Pattern Fragments (tpfs) [18] is a Linked Information publication interface which aims to solve the issue of low availability and performance of existing sparql endpoints for static querying. It does this by moving part of the query processing to the client, which reduces the server load at the cost of increased data transfer and potentially increased query evaluation time. The endpoints are limited to an interface with which merely separate triple patterns can be queried instead of full sparql queries. The client is and so responsible for carrying out the remaining work.

Problem Statement

Traditional public static sparql query endpoints accept a major availability issue. Experiments have shown that more than half of them only reach an availability of less than \(95\,\%\) [half dozen]. The unrestricted complexity of sparql queries [12] combined with the public graphic symbol of sparql endpoints requires an enormous server toll, which tin can lead to a low server availability. Dynamic sparql streaming solutions like c-sparql and cqels offer combined admission to dynamic information streams and static background data through continuously executing queries. Considering of this continuous querying, the cost of these servers tin can go fifty-fifty bigger than with static querying for similarly sized datasets.

The definition of stream reasoning [4] states that it requires reasoning on data streams for "an extremely big number of concurrent users". If we tin can non even reach a large number of concurrent static sparql queries against endpoints without overloading them, how can we wait to do this for dynamic sparql queries? Because evaluating these queries put an even greater load on the server if we assume that the continuous execution of a query requires more processing than the equivalent single execution of that query.

The main inquiry question of our work is:

Question 1: How can nosotros combine the low cost publication of non-loftier frequency existent-time and historical data, such that it can efficiently be queried together with static data?

To reply this question, we also need to find an answer to the following questions:

Question 2: How can we efficiently shop not-loftier frequency real-fourth dimension and historical information and permit efficient transfer to clients?

Question iii: What kind of server interface do we demand to enable client-side query evaluation over both static and dynamic data?

These inquiry questions accept lead to the following hypotheses:

Hypothesis one: Our storage solution tin can store new data in linear time with respect to the corporeality of new data.

Hypothesis 2: Our storage solution tin can think information by fourth dimension or triple values in linear time with respect to the amount of retrieved data.

Hypothesis 3: The server toll for our solution is lower than the alternatives.

Hypothesis 4: Data transfer is the main factor influencing query execution time in relation to other factors like customer processing and server processing.

Research Approach

Every bit discussed in Sect. 2, tpf is a Linked Data publication interface which aims to solve the high server cost of static Linked Information querying. This is done by partially evaluating queries client-side, which requires the client to break downward queries into more unproblematic queries which can be solved by the limited and depression cost tpf server interface. These elementary query results are then locally combined past the client to produce results for the original query.

We will extend this approach to continuously updating querying over dynamic data.

Fig. one.
figure 1

ldf axis showing the server try needed to publish dissimilar types of interface together with a vertical axis showing the factor of data dynamicity an interface exposes.

Full size prototype

Figure 1 shows this shift to more static data in relation to the Linked Data Fragments ( ldf ) [19] axis. ldf is a conceptual framework to compare Linked Data publication interface in which tpf can be seen equally a trade-off between high server and client endeavor for data retrieval. sparql streaming solutions like c-sparql and cqels can handle high frequency data and they require a high server endeavour considering they are at least as expressive every bit regular sparql. Data streams on the other hand expose high frequency data as well, simply here it is the client that has to practise most of the work when selecting data from those streams. Our tpf query streaming extension focuses on not-high frequency data and aims to lower the server effort for more efficient scaling to big numbers of concurrent query executions.

We can separate this research in three parts, which are shown in Fig. ii. First, the server needs to be able to efficiently shop dynamic data and publicly offer it. 2nd, this information must be transferable to the client. Third, a query engine at the client must be able to evaluate queries using this data and keep its answers up to date.

Fig. ii.
figure 2

A client must be able to evaluate queries past retrieving information from multiple heterogeneous datasources.

Total size image

The storage of historical and real-time data requires a delicate balance between storage size and lookup speed. I will develop a method for this storage and lookup with a focus on the efficient retrieval and storage of versions, the dynamic properties of temporal data and the scalability for historical information. This storage method can be based on the differential storage concept TailR uses, combined with hdt [8] compression for snapshots. The interface through which data will be retrieved could benefit a variant of Memento's timegate index to permit users to evaluate historical queries.

For enabling the customer to evaluate queries, the client needs to exist able to access data from 1 or more data providers. I will develop a mechanism that enables efficient manual of temporal data between server and client. Past exploiting the similarities between and within temporal versions, I will limit the required bandwidth every bit much as possible.

To reduce the server cost, the client needs to help evaluating the query. Considering of this, we assume that our solution volition have a higher client processing cost than streaming-based sparql approaches for equivalent queries. For this last goal, I will develop a client-side query engine that is able to practice federated querying of temporal data combined with static data against heterogeneous datasources. The engine must go on the existent-fourth dimension results of the query up to date. We can distinguish iii requirements for this solution:

  • Allowing queries to exist declared using a variant of the sparql language then that information technology becomes possible for clients to declare queries over dynamic information. This linguistic communication should support the rdf stream query semantics that are being discussed within the rsp Community Group Footnote 1. We could either employ the c-sparql or cqels query linguistic communication, or make a variant if required.

  • Building a client-side query engine to do continuous query evaluation, which means that the query results are updated when data changes occur.

  • Providing a format for the delivery of continuously updating results for registered queries that will let applications to handle this data in a dynamic context.

Evaluation Plan

I will evaluate each of the iii major elements of this research independently: the storage solution for dynamic data at the server; the retrieval of this data and its transmission; and the query evaluation by the client.

Storage

The evaluation of our storage solution can be done with the help of two experiments.

Kickoff, I will execute a large number of insertions of dynamic information against a server. I will mensurate its cpu usage and determine if it is still able to attain a decent quality of service for data retrieval. I will also measure the increment in data storage. By analyzing the variance of the cpu usage with different insertion patterns we should be able to accept or reject Hypothesis one, which states that data can be added in linear time.

The second experiment will consist of the evaluation of data retrieval. This experiment will consist of a large number of lookups confronting a server past both triple contents and time instants. Doing a variance analysis on the lookup times over these dissimilar lookup types volition help us to determine the validity of Hypothesis 2, which states that information can be retrieved in linear time.

These two experiments can exist combined to encounter if i or the other demands too much of the server'southward processing ability.

Retrieval and Transmission

To determine the retrieval cost of data from a server and its transmission, we demand to measure the effects of sending a large amount of lookup requests. I of the experiments I performed on the solution that was built during my master's thesis [16] was made upwardly of one server and ten physical clients. Each of these clients could execute from i to 10 concurrent unique queries. This results in a series of 10 to 200 concurrent query executions. This setup was used to exam the client and server performance of my implementation compared to c-sparql and cqels.

Fifty-fifty though this experiment produced some interesting results, equally will exist explained in the next section, 200 concurrent clients are not very representative for big calibration querying on the public Web. Just information technology can already exist used to partially answer Hypothesis 3 that states that our solution has a lower server cost than the alternatives. To extend this, I will develop a mechanism to simulate thousands of simultaneous requests to a server that offers dynamic data. The chief bottleneck in the current experiment setup are the query engines on each client. If we were to disassemble the query engines from the experiment, we could ship much more requests to the server and this would result in more representative results. This could be done by beginning collecting a representative set of http requests that these query engines ship to the server. This set of requests should exist based on real non-loftier frequency employ cases where it makes sense to take a large number of concurrent query evaluations. These requests can be inspired by existing rsp benchmarks like SRBench [20] and CityBench [1]. Once this drove has been built, the client-cpu intensive task is over, and we can employ this pool of requests to quickly simulate http requests to our server. Past doing a variance analysis of the server cpu usage for my solution compared to the alternatives, we will be able to determine the truth of Hypothesis 3.

Query Evaluation

The evaluation of the client side query engine tin be done similar the experiment of my master's thesis, as explained in the previous section. In this case, the results would be representative since the query engine is expected to be the most resource intensive element in this solution. The CityBench [1] rsp benchmark could for example be used to practice measurements based on datasets from various metropolis sensors. Past doing a variance analysis of the different client's cpu usage for my solution compared to the alternatives, we will be able to make up one's mind how much college our customer cpu cost is than the alternatives. The alternatives in this case include server-side rsp engines like c-sparql and cqels, but also fully client-side stream processing solutions using stream publication techniques like Ztreamy [9]. This way, we exam compare our solution with both sides of the ldf axis, on the one hand we have the cases where the server does all of the work while evaluating queries, while on the other hand we have cases where the client does all of the piece of work. For Hypothesis iv, which assumes that data transfer is the main factor for query execution time, nosotros will do a correlation test of bandwidth usage and the respective query's execution times.

Preliminary Results

During my chief'southward thesis, I did some preliminary research on the topic of continuous querying over non-loftier frequency real-time information. My solution consisted of annotating triples with time to give them a timely context, which allowed this dynamic data to be stored on a regular static tpf server. An extra layer on pinnacle of the tpf client was able to interpret these time-annotated triples equally dynamic versions of certain facts. This extra software layer could and then derive the exact moment at which the query should be re-evaluated to keep its results upwardly to date.

The main experiment that was performed in my master'south thesis resulted in the output from Fig. iii. We tin can see that our approach significantly reduced the server load when compared to c-sparql and cqels, as was the main the goal. The customer now pays for the largest function of the query executions, which is caused by the use of tpf. The customer cpu usage for our implementation spikes at the fourth dimension of query initialization because of the rewriting phase, but after that it drops to effectually \(5\,\%\).

Fig. 3.
figure 3

The client and server cpu usages for 1 query stream for c-sparql, cqels and our preliminary solution. Our solution has a very low server cost and a higher average client cpu usage when compared to the alternatives.

Full size image

Conclusions

One time we can publish both non-high frequency real-time and historical data at a low server price, we can finally allow many simultaneous clients to query this information while keeping their results up to engagement, so this dynamic data tin can be used in our applications with the aforementioned ease as nosotros already do today with static information.

The Semantic Sensor Web already promotes the integration of sensors in the Semantic Spider web. My solution would make medium to low frequency sensor data queryable on a web-scale, instead of simply for a few machines in a private surround for keeping the server cost maintainable.

Current Large Data analysis techniques are able to procedure data streams, but combining them with other data past discovering semantic relations nonetheless remains difficult. The solution presented in this piece of work could make these Large Information analyses possible using Semantic Web techniques. This would arrive possible to perform these analyses in a federated manner over heterogeneous sources, since a force of Semantic Web technologies is the ability to integrate data from the whole web. These analyses could be executed by not simply one entity, merely all clients with access to the data, while still putting a reasonable load on the server.

References

  1. Ali, M.I., Gao, F., Mileo, A.: CityBench: a configurable criterion to evaluate RSP engines using smart city datasets. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9367, pp. 374–389. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25010-6_25

    CrossRef  Google Scholar

  2. Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J.: stream: the Stanford information stream management system. Volume chapter (2004). http://ilpubs.stanford.edu:8090/641/1/2004-twenty.pdf

  3. Babu, S., Widom, J.: Continuous queries over data streams. ACM Sigmod Rec. xxx(3), 109–120 (2001). http://dl.acm.org/citation.cfm?id=603884

    CrossRef  Google Scholar

  4. Barbieri, D., Braga, D., Ceri, S., Della Valle, Eastward., Grossniklaus, M.: Stream reasoning: where we got so far. In: Proceedings of the NeFoRS2010 Workshop, Co-located with ESWC 2010 (2010). http://wasp.cs.vu.nl/larkc/nefors10/paper/nefors10_paper_0.pdf

  5. Barbieri, D.F., Braga, D., Ceri, South., Valle, East.D., Grossniklaus, M.: Querying rdf streams with c-sparql. SIGMOD Rec. 39(one), twenty–26 (2010)

    CrossRef  MATH  Google Scholar

  6. Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.-Y.: sparql web-querying infrastructure: gear up for action? In: Alani, H., et al. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 277–293. Springer, Heidelberg (2013)

    CrossRef  Google Scholar

  7. Della Valle, E., Ceri, South., van Harmelen, F., Fensel, D.: It's a streaming world! Reasoning upon rapidly irresolute data. IEEE Intell. Syst. 24(six), 83–89 (2009). http://www.few.vu.nl/ frankh/postscript/IEEE-IS09.pdf

    CrossRef  Google Scholar

  8. Fernández, J.D., Martínez-Prieto, Thousand.A., Gutiérrez, C., Polleres, A., Arias, Chiliad.: Binary rdf representation for publication and commutation (hdt). Spider web Semant. Sci. Serv. Agents World Broad Spider web nineteen, 22–41 (2013). http://www.sciencedirect.com/science/commodity/pii/S1570826813000036

    CrossRef  Google Scholar

  9. Fisteus, J.A., Garcia, N.F., Fernandez, L.S., Fuentes-Lorenzo, D.: Ztreamy: a middleware for publishing semantic streams on the web. Spider web Semant. Sci. Serv. Agents Earth Wide Spider web 25, 16–23 (2014)

    CrossRef  Google Scholar

  10. Le-Phuoc, D., Dao-Tran, M., Xavier Parreira, J., Hauswirth, One thousand.: A native and adaptive approach for unified processing of linked streams and linked data. In: Aroyo, L., et al. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 370–388. Springer, Heidelberg (2011)

    CrossRef  Google Scholar

  11. Meinhardt, P., Knuth, Yard., Sack, H.: TailR: a platform for preserving history on the web of information. In: Proceedings of the 11th International Conference on Semantic Systems, pp. 57–64. ACM (2015). http://dl.acm.org/citation.cfm?id=2814875

  12. Pérez, J., Arenas, 1000., Gutierrez, C.: Semantics and complexity of sparql. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, 1000., Aroyo, 50.One thousand. (eds.) ISWC 2006. LNCS, vol. 4273, pp. xxx–43. Springer, Heidelberg (2006)

    CrossRef  Google Scholar

  13. Sheth, A., Henson, C., Sahoo, Southward.: Semantic sensor spider web. IEEE Internet Comput. 12(iv), 78–83 (2008). http://corescholar.libraries.wright.edu/cgi/viewcontent.cgi?article=2125&context=knoesis

  14. de Sompel, H.5., Nelson, One thousand.L., Sanderson, R., Balakireva, L., Ainsworth, South., Shankar, H.: Memento: time travel for the web. CoRR abs/0911.1112 (2009). http://arxiv.org/abs/0911.1112

  15. de Sompel, H.V., Sanderson, R., Nelson, K.L., Balakireva, L., Shankar, H., Ainsworth, South.: An http-based versioning machinery for linked data. CoRR abs/1003.3661 (2010). http://arxiv.org/abs/1003.3661

  16. Taelman, R.: Continuously updating queries over real-time linked data. Principal's thesis, Ghent University, Belgium (2015). http://rubensworks.net/raw/publications/2015/continuously_updating_queries_over_real-time_linked_data.pdf

  17. Vander Sande, M., Colpaert, P., Verborgh, R., Coppens, Southward., Mannens, E., Van de Walle, R.: R&Wbase: git for triples. In: LDOW (2013). http://events.linkeddata.org/ldow2013/papers/ldow2013-paper-01.pdf

  18. Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, 1000., Cyganiak, R., Colpaert, P., Mannens, E., Van de Walle, R.: Querying datasets on the Spider web with high availability. In: Proceedings of the 13th International Semantic Spider web Conference (2014). http://linkeddatafragments.org/publications/iswc2014.pdf

    Google Scholar

  19. Verborgh, R., Vander Sande, G., Colpaert, P., Coppens, S., Mannens, Due east., Van de Walle, R.: Web-calibration querying through Linked Data Fragments. In: Proceedings of the 7th Workshop on Linked Information on the Web (2014). http://events.linkeddata.org/ldow2014/papers/ldow2014_paper_04.pdf

  20. Zhang, Y., Duc, P.M., Corcho, O., Calbimonte, J.-P.: SRBench: a streaming rdf/sparql benchmark. In: Heflin, J., et al. (eds.) ISWC 2012, Function I. LNCS, vol. 7649, pp. 641–657. Springer, Heidelberg (2012)

    CrossRef  Google Scholar

Download references

Writer information

Affiliations

Corresponding author

Correspondence to Ruben Taelman .

Rights and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Taelman, R. (2016). Continuously Self-Updating Query Results over Dynamic Heterogeneous Linked Data. In: Sack, H., Blomqvist, E., d'Aquin, M., Ghidini, C., Ponzetto, Due south., Lange, C. (eds) The Semantic Web. Latest Advances and New Domains. ESWC 2016. Lecture Notes in Informatics(), vol 9678. Springer, Cham. https://doi.org/10.1007/978-three-319-34129-3_55

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI : https://doi.org/10.1007/978-three-319-34129-3_55

  • Published:

  • Publisher Proper name: Springer, Cham

  • Impress ISBN: 978-3-319-34128-vi

  • Online ISBN: 978-three-319-34129-iii

  • eBook Packages: Informatics Computer Scientific discipline (R0)

salieranguesse.blogspot.com

Source: https://link.springer.com/chapter/10.1007/978-3-319-34129-3_55