Scalable load balancing for mapreduce-based record linkage

Wei Yan; Yuan Xue; Bradley Malin

doi:10.1109/PCCC.2013.6742785

Abstract

Recent research has introduced load balancing schemes that are aware of the input data distribution (i.e., data profile) to mitigate data skew and fully exploit the parallel capability of the MapReduce framework to support record linkage. However, existing solutions face a significant scalability issue when applied to massive data sets with millions or billions of blocks (a basic unit in record linkage) because their data profiles can not be maintained precisely in an efficient manner. The goal of this paper is to introduce a profiling method based on the notion of a sketch, which allows for a compact scalable solution for maintaining block size statistics. In addition, we propose two load balancing algorithms to work over sketch-based profiles while solving the data skew problem associated with record linkage. We provide an analytical analysis and extensive experiments (using Hadoop), with real and controlled synthetic data sets, to illustrate the effectiveness of our solution. The experimental results show that our load balancing algorithms can decrease the overall job completion time by 71.56% and 70.73% of the default settings in Hadoop using a set of DBLP data sets, which have 2.5 to 50.4 million records.

Scalable load balancing for mapreduce-based record linkage

Authors

Abstract

Related Articles