DOT-K: A distributed online top-K elements algorithm using extreme value statistics

Nicholas Carey; Tamás Budavári; Yanif Ahmad; Alexander Szalay

doi:10.1109/eScience.2016.7870893

Abstract

Extremely large (peta-scale) data collections are generally partitioned into millions of containers (disks/volumes/files) which are essentially unmovable due to their aggregate size. They are stored over a large distributed cloud of machines, with computing co-located with the data. Given this data layout, even simple tasks are difficult to perform and naive algorithms can easily become quite expensive. We present a one pass, communications-efficient technique useful for both estimating upper order quantiles and selecting the largest k elements across a highly distributed dataset or stream. Our novel approach draws its foundations from Extreme Value Statistics (EVS) to reason about the statistical relationships between the tail distributions of dataset partitions. The tail of each partition is fitted by the Generalized Pareto Distribution, which captures threshold exceedances. The obtained parameters are communicated to a central coordinator and used to estimate quantiles, or solve for a threshold above which there are approximately k elements. We discuss the computational and bandwidth costs of the algorithm, and demonstrate the accuracy of the method on both a variety of synthetic datasets and a PageRank dataset.

DOT-K: A distributed online top-K elements algorithm using extreme value statistics

Authors

Abstract

Related Articles