2021 IEEE International Conference on Big Data (Big Data)
Download PDF

Abstract

Entity matching is an issue of interest in information integration and data cleaning. Since the representations of the same entity vary, it is often impossible to fully automate the entity matching and require human inputs. However, to guarantee high-quality entity matching, how to integrate human resources into the entity matching while minimizing the cost of human resources? In this paper, we propose BUBBLE, a novel human-in-the-loop entity matching framework hybridizing Bayesian inference and crowdsourcing. To guarantee entity matching quality, Bayesian inference is conducted to determine whether the matching requires crowdsourcing. We show that we can define Bayesian error rate for this problem. For optimization, we use metric learning to select the candidate matching pairs by nearest-neighbor search in the learned embedding space, and we construct a k-nearest neighbor graph to avoid the redundant matching. We applied BUBBLE to a bibliographic data matching problem on the National Diet Library. The experimental results show that BUBBLE can assign tasks to humans with higher quality results compared to those of the same number of task assignments to humans. The result also shows that our optimization scheme is effective without sacrificing the quality.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles