Abstract
Sequencing the DNA of the estimated 7.5 billion living humans would generate 1.4 zettabytes of data. However, given current per-read rendering techniques, just one DNA alignment file which is around 200 gigabytes can be resource intensive to visualize at arbitrary scale. Going from human DNA and RNA sequencing data to biological insight is a process that requires domain knowledge in addition to computational methods that are bound by time and space. We address these limitations by integrating a parallel out-of-core feature extraction algorithm with a disk-based hierarchical data store that provides several orders of magnitude speed-up for common analysis and visualization tasks. To demonstrate the effectiveness of our strategy, we have developed a web-based REST service that serves translated data to a real-time genomic viewer, which in turn renders standardized moments as stacked-area graphs of features in milliseconds for multiple samples using a familiar genome browser interface. Unlike per-read techniques which read a variable number of rows from the sequence alignment file depending on the region of interest, our data structure returns a controllable data size of that region, making the technique ideally suited for visualization and macro-level insight of large cohorts. The strategy works well for high-coverage single coordinate-based visualization but could be extended for use in other long-range visualization techniques. We detail our open-source Cython/Python based implementation as well as our prototype web-based visualization tool and then compare the resulting performance and against established visualization tools.