Comunica: an overview

Comunica: an overview

In reply to

Notifications and annotations

Abstract

Query evaluation over Linked Data sources has become a complex story, given the multitude of algorithms and techniques for single- and multi-source querying, as well as the heterogeneity of Web interfaces through which data is published online. Today’s query processors are insufficiently adaptable to test multiple query engine aspects in combination, such as evaluating the performance of a certain join algorithm over a federation of heterogeneous interfaces. The Semantic Web research community is in need of a flexible query engine that allows plugging in new components such as different algorithms, new or experimental SPARQL features, and support for new Web interfaces. We designed and developed a Web-friendly and modular meta query engine called Comunica that meets these specifications. Some time ago we published our first resources paper about Comunica. In this article, we expand upon the architectural decisions and introduce many of the resources available to get started developing with the framework. We show how its modular nature makes it an ideal research platform for investigating new kinds of Linked Data interfaces and querying algorithms. We additionally cover some of the instances were Comunica is already being used by other tools. Comunica facilitates the development, testing, and evaluation of new query processing capabilities, both in isolation and in combination with others.

Introduction

Linked Data on the Web exists in many shapes and forms—and so do the processors we use to query data from one or multiple sources. For instance, engines that query RDF data using the SPARQL language [1] employ different algorithms [2, 3] and support different language extensions [4, 5]. Furthermore, Linked Data is increasingly published through different Web interfaces, such as data dumps, Linked Data documents [6], SPARQL endpoints [7] and Triple Pattern Fragments (TPF) interfaces [8]. This has led to entirely different query evaluation strategies, such as server-side [7], link-traversal-based [9], shared client–server query processing [8], and client-side (by downloading data dumps and loading them locally).

The resulting variety of implementations suffers from two main problems: a lack of sustainability and a lack of comparability. Alternative query algorithms and features are typically either implemented as forks of existing software packages [10, 11, 12] or as independent engines [13]. This practice has limited sustainability: forks are often not merged into the main software distribution and hence become abandoned; independent implementations require a considerable upfront cost and also risk abandonment more than established engines. Comparability is also limited: forks based on older versions of an engine cannot meaningfully be evaluated against newer forks, and evaluating combinations of cross-implementation features—such as different algorithms on different interfaces—is not possible without code adaptation. As a result, many interesting comparisons are never performed because they are too costly to implement and maintain. For example, it is currently unknown how the Linked Data Eddies algorithm [13] performs over a federation [8] of brTPF interfaces [14]. Another example is that the effects of various optimizations and extensions for TPF interfaces [10, 11, 12, 13, 14, 15, 16, 17] have only been evaluated in isolation, whereas certain combinations will likely prove complementary.

In order to handle the increasing heterogeneity of Linked Data on the Web, as well as various solutions for querying it, there is a need for a flexible and modular query engine to experiment with all of these techniques—both separately and in combination. In this article, we introduce Comunica to realize this vision. It is a highly modular meta engine for federated SPARQL query evaluation over heterogeneous interfaces, including TPF interfaces, SPARQL endpoints, and data dumps. Comunica aims to serve as a flexible research platform for designing, implementing, and evaluating new and existing Linked Data querying and publication techniques.

Comunica differs from existing query processors on different levels:

  1. The modularity of the Comunica meta query engine allows for extensions and customization of algorithms and functionality. Users can build and fine-tune a concrete engine by wiring the required modules through an RDF configuration document. By publishing this document, experiments can repeated and adapted by others.
  2. Within Comunica, multiple heterogeneous interfaces are first-class citizens. This enables federated querying over heterogeneous sources and makes it for example possible to evaluate queries over any combination of SPARQL endpoints, TPF interfaces, datadumps, or other types of interfaces.
  3. Comunica is implemented using Web-based technologies in JavaScript, which enables usage through browsers, the command line, the SPARQL protocol [7], or any Web or JavaScript application.

Comunica and its default modules are publicly available on GitHub and the npm package manager under the open-source MIT license (canonical citation: https:/​/​zenodo.org/record/1202509#.Wq9GZhNuaHo).

Some time ago we released our initial resource paper about Comunica [18] where we introduced our new framework. In this paper we cover the architecture more extensively to provide a clearer picture of the internals of the framework. Besides that we also cover many of the additional tools that have been made around Comunica. These cover both tools that help developers to work on Comunica, making their life easier there, and a showcase of other technologies that already make use of Comunica to handle their querying of Linked Data.

{.todo} Update the structure section below based on final structure.

This article is structured as follows. In the next section, we discuss the related work, followed by the main features of Comunica in . After that, we introduce the architecture of Comunica in Section 2, and its implementation in Section 3. Next, we compare the performance of different Comunica configurations with the TPF Client in . Finally, Section 7 concludes and discusses future work.

Architecture

In this section, we discuss the design and architecture of the Comunica meta engine, and show how it conforms to the modularity feature requirement. In summary, Comunica is collection of small modules that, when wired together, are able to perform a certain task, such as evaluating SPARQL queries. We first discuss the customizability of Comunica at design-time, followed by the flexibility of Comunica at run-time. Finally, we give an overview of all modules.

Customizable Wiring at Design-time through Dependency Injection

There is no such thing as the Comunica engine, instead, Comunica is a meta engine that can be instantiated into different engines based on different configurations. Comunica achieves this customizability at design-time using the concept of dependency injection [19]. Using a configuration file, which is created before an engine is started, components for an engine can be selected, configured and combined. For this, we use the Components.js [20] JavaScript dependency injection framework, This framework is based on semantic module descriptions and configuration files using the Object-Oriented Components ontology [21].

Description of Individual Software Components

In order to refer to Comunica components from within configuration files, we semantically describe all Comunica components using the Components.js framework in JSON-LD [22]. Listing 1 shows an example of the semantic description of an RDF parser.

Description of Complex Software Configurations

A specific instance of a Comunica engine can be initialized using Components.js configuration files that describe the wiring between components. For example, Listing 2 shows a configuration file of an engine that is able to parse N3 and JSON-LD-based documents. This example shows that, due to its high degree of modularity, Comunica can be used for other purposes than a query engine, such as building a custom RDF parser.

Since many different configurations can be created, it is important to know which one was used for a specific use case or evaluation. For that purpose, the RDF documents that are used to instantiate a Comunica engine can be published as Linked Data [21]. They can then serve as provenance and as the basis for derived set-ups or evaluations.

{
  "@context": [ ... ],
  "@id": "npmd:@comunica/actor-rdf-parse-n3",
  "components": [
    {
      "@id":            "crpn3:Actor/RdfParse/N3",
      "@type":          "Class",
      "extends":        "cbrp:Actor/RdfParse",
      "requireElement": "ActorRdfParseN3",
      "comment":        "An actor that parses Turtle-like RDF",
      "parameters": [
        {
          "@id": "caam:Actor/AbstractMediaTypedFixed/mediaType",
          "default": [ "text/turtle", "application/n-triples" ]
        }
      ]
    }
  ]
}

Listing 1: Semantic description of a component that is able to parse N3-based RDF serializations. This component has a single parameter that allows media types to be registered that this parser is able to handle. In this case, the component has four default media types.

{
  "@context": [ ... ],
  "@id": "http://example.org/myrdfparser",
  "@type": "Runner",
  "actors": [
    { "@type": "ActorInitRdfParse",
      "mediatorRdfParse": {
        "@type": "MediatorRace",
        "cc:Mediator/bus": { "@id": "cbrp:Bus/RdfParse" }
      } },
    { "@type": "ActorRdfParseN3",
      "cc:Actor/bus": "cbrp:Actor/RdfParse" },
    { "@type": "ActorRdfParseJsonLd",
      "cc:Actor/bus": "cbrp:Actor/RdfParse" },
  ]
}

Listing 2: Comunica configuration of ActorInitRdfParse for parsing an RDF document in an unknown serialization. This actor is linked to a mediator with a bus containing two RDF parsers for specific serializations.

Flexibility at Run-time using the Actor–Mediator–Bus Pattern

Once a Comunica engine has been configured and initialized, components can interact with each other in a flexible way using the actor [23], mediator [24], and publish–subscribe [25] patterns. Any number of actor, mediator and bus modules can be created, where each actor interacts with mediators, that in turn invoke other actors that are registered to a certain bus.

Fig. 1 shows an example logic flow between actors through a mediator and a bus. The relation between these components, their phases and the chaining of them will be explained hereafter.

[actor-mediator-bus pattern]

Fig. 1: Example logic flow where Actor 0 requires an action to be performed. This is done by sending the action to the Mediator, which sends a test action to Actors 1, 2 and 3 via the Bus. The Bus then sends all test replies to the Mediator, which chooses the best actor for the action, in this case Actor 3. Finally, the Mediator sends the original action to Actor 3, and returns its response to Actor 0.

Relation between Actors and Buses

Actors are the main computational units in Comunica, they are responsible for handling all the tasks and computations that need to be done. These can range from simple jobs like executing a HTTP request to more complex functions such as solving a SPARQL query. The main idea is that more complex actors can delegate some of their work to the simpler actors, thereby reducing the implementation required and increasing the reuse of existing code.

Buses and mediators form the glue that tie the actors together and makes them interactable. Every actor subscribes to one or more buses, which contain a collection of actors. Actors are responsible for being able to accept certain messages via their bus(es), and for responding with an answer. Fig. 2 shows an example of how actors can be registered to buses.

Initially we thought about having a single bus where all messages would be sent. Every actor would then have to check every message to see if it applied to them. This would have caused a drastic overhead though, forcing many unnecessary checks from actors for unrelated messages.

In order to avoid this issue we created multiple buses and actors are grouped together based on their functionality. This greatly reduces the amount of unneeded message checks. The groupings happen based on the functionality of the actors. All actors that do transformations on bindings are grouped together for example, the same for actors that handle metadata, and so on. Note that these are the grouping that happen in the current existing implementations using Comunica. Nothing is stopping a developer from creating a configuration that only makes use of a single bus.

[relation between actors and buses]

Fig. 2: An example of two different buses each having two subscribed actors. The left bus has different actors for parsing triples in a certain RDF serialization to triple objects. The right bus has actors that join query bindings streams together in a certain way.

Mediators handle Actor Run and Test Phases

Each mediator is connected to a single bus, and its goal is to determine and invoke the best actor for a certain task. The definition of ‘best’ depends on the mediator, and different implementations can lead to different choices in different scenarios. A mediator works in two phases: the test phase and the run phase. The test phase is used to check under which conditions the action can be performed in each actor on the bus. This phase must always come before the run phase, and is used to select which actor is best suited to perform a certain task under certain conditions. If such an actor is determined, the run phase of a single actor is initiated. This run phase takes this same type of message, and requires to effectively act on this message, and return the result of this action. Fig. 3 shows an example of a mediator invoking a run and test phase.

[mediators handle actor run and test phases]

Fig. 3: Example sequence diagram of a mediator that chooses the fastest actor on a parse bus with two subscribed actors. The first parser is very fast but requires a lot of memory, while the second parser is slower, but requires less memory. Which one is best, depends on the use case and is determined by the Mediator. The mediator first calls the tests the actors for the action, and then runs the action using the best actor.

It is up to the actors to provide a correct implementation of both the test and run functions. The run function is expected, it is the implementation of the actor’s functionality. The test function might be harder on the other hand. It consists of two steps: first it has to determine whether it can actually act on the given input, even though similar actors are grouped together some of them might be more specific and not able to handle all kinds of input. Secondly it has to return how much effort is required to execute this actor. Since it is not always possible to predict the costs of executing the run function, this function will often rely on estimates. Although there is nothing stopping an actor from lying about its functionality and providing incorrect estimates once it has been included, such actors should simply be removed again from the configuration.

Comunica does not enforce any other structure besides the actor model described above. How this model gets implemented will depend on the developer. In the actors we already provide we split up functionality as much as possible over multiple actors, thereby increasing actor reuse and focussing each actor on their own separate task, but on the complete other end of the spectrum it is possible to have an application with a single actor that does all the work, although this does lose the advantages of the actor model. With the actors we already provide we try to guide developers to create irreducible actors, but in the end there is flexibility there.

Executing a SPARQL query

Although Comunica is not a single engine, we do provide some actors that have a preset configuration allowing users to immediately make use of its features. The main actor there is actor-init-sparql, which provides a configuration of actors allowing users to execute SPARQL over multiple heterogeneous sources.

Fig. 4 shows how an input query would be handled by that Comunica engine. Firstly the query would be parsed and converted to SPARQL algebra, which is how queries are represented internally. That algebra then gets sent to the Query Operation bus. This is a collection of actors that each handle one specific algebra operation, and recursively call the same bus again to solve the remaining algebra. Once a quad pattern gets reached, it gets sent to the Quad Pattern Resolver bus. Depending on the sources different actions are taken there. If there are multiple sources, these all get queried for results by recursively calling the same bus with the separate sources. In the case of a TPF source the paginated data gets streamed and the metadata extracted to finally reach the correct bindings, while in the case of a SPARQL endpoint a new query gets composed to get the corresponding results. These results then go all the way back up the stack of actors to be serialized in one of the available formats.

[sparql actor diagram]

Fig. 4: An overview of the the flow through the actors when executing a SPARQL query with Comunica.

Modules

At the time of writing, Comunica consists of 118 different modules. This consists of 17 buses, 5 mediator types, 83 actors and 13 other modules. In this section, we will only discuss the most important actors and their interactions.

The main bus in Comunica is the query operation bus, which consists of 34 different actors that provide at least one possible implementation of the typical SPARQL operations such as quad patterns, basic graph patterns (BGPs), unions, projects, … These actors interact with each other using streams of quad or solution mappings, and act on a query plan expressed in in SPARQL algebra [1].

In order to enable heterogeneous sources to be queried in a federated way, we allow a list of sources, annotated by type, to be passed when a query is initiated. These sources are passed down through the chain of query operation actors, until the quad pattern level is reached. At this level, different actors exist for handling a single source of a certain type, such as TPF interfaces, SPARQL endpoints, local or remote data dumps. In the case of multiple sources, one actor exists that implements a federation algorithm defined for TPF [8], but instead of federating over different TPF interfaces, it federates over different single-source quad pattern actors.

At the end of the pipeline, different actors are available for serializing the results of a query in different ways. For instance, there are actors for serializing the results according to the SPARQL JSON [26] and XML [27] result specifications, but actors with more visual and developer-friendly formats are available as well.

SPARQL Algebra

As mentioned before, internally Comunica converts SPARQL queries to SPARQL algebra. For this we make use of two parsers, another that converts that format to a JSON representation of SPARQL algebra. The representation is made to be as close to functions described in the specification as possible.

Using SPARQL algebra allows us to create actors that focus on the core actions required to solve SPARQL queries, without being too dependent on how the query was actually written. For every algebra operation we have (at least) one actor that solves that specific operation It also provides easier options to restructure and optimize the query before actual execution begins.

Implementation

Comunica is implemented in TypeScript/JavaScript as a collection of Node modules, which are able to run in Web browsers using native Web technologies. Comunica is available under an open license on GitHub and on the NPM package manager. The 79 Comunica modules are tested thoroughly, with more than 1,200 unit tests reaching a test coverage of 100%. In order to be compatible with existing JavaScript RDF libraries, Comunica follows the JavaScript API specification by the RDFJS community group, and will actively be further aligned within this community. In order to encourage collaboration within the community, we extensively use the GitHub issue tracker for planned features, bugs and other issues. Finally, we publish detailed documentation for the usage and development of Comunica.

We provide a default Linked Data-based configuration file with all available actors for evaluating federated SPARQL queries over heterogeneous sources. This allows SPARQL queries to be evaluated using a command-line tool, from a Web service implementing the SPARQL protocol [7], within a JavaScript application, or within the browser. We fully implemented SPARQL 1.0 [28] and a subset of SPARQL 1.1 [1] at the time of writing. In future work, we intend to implement additional actors for supporting SPARQL 1.1 completely.

Comunica currently supports querying over the following types of heterogeneous datasources and interfaces:

In order to demonstrate Comunica’s ability to evaluate federated query evaluation over heterogeneous sources, the following guide shows how you can try this out in Comunica yourself.

Support for new algorithms, query operators and interfaces can be implemented in an external module, without having to create a custom fork of the engine. The module can then be plugged into existing or new engines that are identified by RDF configuration files.

In the future, we will also look into adding support for other interfaces such as brTPF [14] for more efficient join operations and VTPF [15] for queries over versioned datasets.

In-use

To further help people get started with Comunica we have provided tutorials on multiple conferences which help with some of the first hurdles in getting everything working and making use of it. These tutorials can all be found online and give a solid introduction to getting started with Comunica. These can be used next to the existing documentation to fully grasp what can be done with the system and how it can be extended.

The first tutorial covers all the different ways the already existing actors and configurations can be used to execute SPARQL queries over new and existing data sources. These range from executing queries with a command line tool to embedding Comunica in the code of your own project to setting up a Web service other users can use to execute queries.

The second tutorial explains all the steps necessary to create a new actor that can be embedded into the existing configurations. This includes an overview of all the configuration files that are required and what each of them has to contain. Besides that it also covers the helper classes that can be used and how an existing configuration can be adapted to add the new actor.

LDflex is a language that allows developers to easily access RDF data in JavaScript. This makes RDF data much more accessible for front-end developers and other people who might not be used to working with SPARQL queries but still want to make use of Linked Data, thereby making it easier to spread the use of Semantic technologies. It is used by the Solid client among other, and to support queries it makes use of Comunica. Additionally, a Comunica actor has already been made to support making HTTP requests using Solid authentication, allowing users to query their private Solid data through Comunica.

Sparqlee is a JavaScript library to evaluate SPARQL expressions. This library was built for Comunica to support these expressions, but due to the modular nature of Comunica, it can also fully stand on its own as a separate library and could even be used by other SPARQL libraries to handle their expressions.

The Comunica repository already has been starred 89 times at the time of writing and has been forked 12 times. The core Comunica npm package, which is required for all Comunica variants, currently has 1083 weekly downloads. All Comunica packages can also be found separately on npm. Currently there are already 128 packages available to be used.

Tooling

There are already several tools available to further extend what can be done with Comunica or to help in its usage.

jQuery Widget

The jQuery Widget creates a browser-based GUI for the Comunica SPARQL client, allowing users to execute SPARQL queries in their browser without having to install anything. While it originally made use of the TPF client, and thus had less support for heterogeneous sources, it now uses Comunica in the backend providing all its advantages. Besides the support for many different kinds of interfaces, it also supports executing GraphQL queries over those sources. Changing the Comunica configuration that is being used is simply a question of changing the configuration file in the repository, thereby allowing anyone to quickly set up an online querying engine with their own implementation.

Bencher

Although evaluations are much easier with Comunica due to the easy swapping of modules, there is still some work that needs to be done before they can actually happen, such as setting up the server, adding the test data and having test queries. Comunica bencher helps users skip several of those steps. It can automatically generate everything that is necessary to set up a TPF server, create a Watdiv [41] dataset and corresponding queries. It can then also set up a client to run those queries over the server and generate graphs from the results. Although not every kind of setup can be covered with this, it is already incredibly useful for many use cases. Specifics such as the amount of dry runs, caching, data size, etc. can all be tweaked in the configuration, allowing users to fine-tune the tool to their needs. Finally, this way the tests can also easily be shared for later reuse.

Generator

Creating a Comunica actor can be a complex endeavour for developers without any experience in the system: several configuration files have to be made, these have to be mapped to the code, and the code has to be follow certain rules to make sure it integrates with the rest of Comunica. That is why we created a Comunica generator, which creates an empty actor with all the necessary files already in place with all the default values. It creates this initial project by asking the user several questions and then automatically generates the corresponding TypeScript and JSON-LD files with the values based on the answers. These still need to be changed, especially if the user has many extra requirements for their actor, but it greatly reduces the start-up time when coding and reduces the chances of errors or missing files.

Conclusions

In this work, we introduced Comunica as a highly modular meta engine for federated SPARQL query evaluation over heterogeneous interfaces. Comunica is thereby the first system that accomplishes the Linked Data Fragments vision of a client that is able to query over heterogeneous interfaces. Not only can Comunica be used as a client-side SPARQL engine, it can also be customized to become a more lightweight engine and perform more specific tasks, such as for example only evaluating BGPs over Turtle files, evaluating the efficiency of different join operators, or even serve as a complete server-side SPARQL query endpoint that aggregates different datasources. In future work, we will look into supporting supporting alternative (non-semantic) query languages as well, such as GraphQL [42].

If you are a Web researcher, then Comunica is the ideal research platform for investigating new Linked Data publication interfaces, and for experimenting with different query algorithms. New modules can be implemented independently without having to fork existing codebases. The modules can be combined with each other using an RDF-based configuration file that can be instantiated into an actual engine through dependency injection. However, the target audience is broader than just the research community. As Comunica is built on Linked Data and Web technologies, and is extensively documented and has a ready-to-use API, developers of RDF-consuming (Web) applications can also make use of the platform. In the future, we will continue maintaining and developing Comunica and intend to support and collaborate with future researchers on this platform.

The introduction of Comunica will trigger a new generation of Web querying research. Due to its flexibility and modularity, existing areas can be combined and evaluated in more detail, and new promising areas that remained covered so far will be exposed.