2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
Download PDF

Abstract

We have built a tool for inspecting and managing data lakes. The motivations for creating this tool are 1) schema discovery (determining links pertinent to solving a data analysis problem), 2) discovering high risk links in data schemas that give rise to Information Security problems and 3) discovering high value relationships enabling data asset curation. The tool works by extracting metadata from the Hive database on a shared-tenancy instance of Hadoop, which contained a multi-terabyte real-world data asset. We use this metadata to calculate a graph of the relationships between the entities based on column matching. This allows us to apply Social Network Analysis (SNA) techniques in order to discover meaningful properties of the accumulated data. For example to extract previously unknown relationships between data entities. The challenges and the agenda for future research are also provided.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles