Abstract
Apache Hive is a widely used data warehousing and analysis tool. Developers write SQL like HIVE queries, which are converted into MapReduce programs to runs on a cluster. Despite its popularity, there is little research on performance comparison and diagnose. Part of the reason is that instrumentation techniques used to monitor execution can not be applied to intermediate MapReduce code generated from Hive query. Because the generated MapReduce code is hidden from developers, run time logs are the only places a developer can get a glimpse of the actual execution. Having an automatic tool to extract information and to generate report from logs is essential to understand the query execution behavior. We designed a tool to build the execution profile of individual Hive queries by extracting information from HIVE and Hadoop logs. The profile consists of detailed information about MapReduce jobs, tasks and attempts belonging to a query. It is stored as a JSON document in MongoDB and can be retrieved to generate reports in charts or tables. We have run several experiments on AWS with TPC-H data sets and queries to demonstrate that our profiling tool is able to assist developers in comparing HIVE queries written in different formats, running on different data sets and configured with different parameters. It is also able to compare tasks/attempts within the same job to diagnose performance issues.