|
Published Articles >> Table of Contents >> Abstract
Sixth IEEE International Conference on Data Mining (ICDM'06)
pp. 159-170
Rapid Identification of Column Heterogeneity
Bing Tian Dai, National Univ. of Singapore, Singapore
Nick Koudas, University of Toronto
Beng Chin Ooi, National Univ. of Singapore, Singapore
Divesh Srivastava, AT&T Labs-Research
Suresh Venkatasubramanian, AT&T Labs--Research
Full Article Text:

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDM.2006.132
Send link to a friend
| Abstract |
|
Data quality is a serious concern in every data management
application, and a variety of quality measures have
been proposed, e.g., accuracy, freshness and completeness,
to capture common sources of data quality degradation. We
identify and focus attention on a novel measure, column heterogeneity,
that seeks to quantify the data quality problems
that can arise when merging data from different sources.
We identify desiderata that a column heterogeneity measure
should intuitively satisfy, and describe our technique
to quantify database column heterogeneity based on using
a novel combination of cluster entropy and soft clustering.
Finally, we present detailed experimental results, using diverse
data sets of different types, to demonstrate that our
approach provides a robust mechanism for identifying and
quantifying database column heterogeneity.
|
Additional Information
|
Citation:
Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, Suresh Venkatasubramanian,
"Rapid Identification of Column Heterogeneity,"
icdm,
pp. 159-170,
Sixth IEEE International Conference on Data Mining (ICDM'06),
2006
|
|