Abstract
With the increasing popularity of the Semantic Web, more and more data becomes available in RDF with SPARQL as a query language. Data sets, however, can become too big to be managed and queried on a single server in a scalable way. Existing distributed RDF stores approach this problem using data partitioning, aiming at limiting the communication between servers and exploiting parallelism. This paper proposes a distributed SPARQL engine that combines a graph partitioning technique with workload-aware replication of triples across partitions, enabling efficient query execution even for complex queries from the workload. Furthermore, it discusses query optimization techniques for producing efficient execution plans for ad-hoc queries not contained in the workload.