Abstract
The 13th International Workshop on Data Mining in Bioinformatics (BIOKDD’14) was organized in conjunction with the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining on August 24, 2014 in New York, USA. It brought together international researchers in the interacting disciplines of data mining, systems biology, and bioinformatics at the Bloomberg Headquarters venue. The goal of this workshop is to encourage Knowledge Discovery and Data mining (KDD) researchers to take on the numerous challenges that Bioinformatics offers. This year, the workshop featured the theme of “Knowledge discovery using big data in biological/biomedical systems”.
In the last few years, there has been a rapid development in various high-throughput technologies, which has led to the accumulation of a large amount of data from different areas of molecular and cellular biology. These developments, together with increasing interest in the community for gaining a systems-wide understanding of the cellular machinery, have provided us unprecedented insights into the structure, organization, and dynamics of various major cellular processes such as transcription, translation, degradation, replication, metabolism, etc. Likewise, efforts to understand the interaction of the cell with external environment have generated global phenotypic maps such as those due to small-molecule perturbations and human microbiomes, which provide us with unparalleled information on the wide variety of microbes that interact with the host’s tissues and play an important role in health and disease of an individual. Despite the growing amount of data representing each of these processes it should be admitted that none of these cellular processes work in isolation but rather form an integrated network of different wiring diagrams, which is responsible for the observed behavior of the cell within the context of its environment. While there is mounting evidence from several recent studies that each of these networks of associations associated with a particular cellular process can be studied in detail to provide meaningful insights into how they contribute to the functioning of the cell, as well as to identify the factors that constrain their structure and how they influence the genomes on which they are encoded, it is clear that an open challenge of contemporary biology is to integrate these diverse cellular programs to first understand and model in quantitative terms the topological and dynamic properties of such a unified cellular network, and then to exploit them for the therapeutic benefit of mankind. This field of integrative systems biology, generating large and disparate kinds of biological datasets, is full of opportunities for applying computational and statistical approaches, especially from data mining and machine learning. The goal here is to build accurate predictive or descriptive models of biological processes and diseases, and in integrating data and knowledge-bases from diverse sources to provide experimentally testable hypotheses. These approaches have already revolutionized new age biology by enabling novel discoveries from basic biology to complex disease contexts, as well as in the development of therapeutics. Data mining will continue to play an essential role in understanding these fundamental problems and in the development of novel therapeutic/ diagnostic solutions in post-genomic medicine.
Papers for this special section were selected from the BIOKDD’14 workshop. To meet the acceptance criteria for the IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB ), each of the papers selected for presentation at the workshop underwent additional reviews by at least two reviewers managed by the TCBB editors. We are very grateful to the anonymous reviewers in helping us select the following papers for this special section. The first paper, “Divide and Conquer Approach to Contact Map Overlap Problem using 2D-Pattern Mining of Protein Contact Networks”, by Suvarna Vani Koneru and Durga Bhavani S, presents an interesting divide and conquer approach for optimizing the running time for the contact map overlap (CMO) problem. Protein structure prediction and comparison is perhaps one of the most challenging problems in bioinformatics. It is typically modeled as CMO problem in which the similarity of two proteins being compared is measured by the amount of overlap between their corresponding protein contact maps. Protein contact map is a two-dimensional representation of the protein 3D structure. In this study, they propose a novel approach to the CMO problem, which involves finding matching regions between the two contact maps using an approximate 2D-pattern matching algorithm, and dynamic programming technique. These matched pairs of small contact maps are submitted in parallel to a fast heuristic CMO algorithm. The approach facilitates parallelization at this level since all the pairs of contact maps can be submitted to the algorithm in parallel. Then, a merge algorithm is used in order to obtain the overall alignment. The authors show that this algorithm along with achieving better running time can also obtain better overlap for certain protein folds.
The second paper, “Biclustering with Flexible Plaid Models to Unravel Interactions between Biological Processes” by Rui Henriques and Sara C. Madeira, presents an improved biclustering approach which can identify functional modules with associated or interacting genes. Biclusters are subspaces where a subset of rows exhibits a correlated pattern over a subset of columns. The plaid assumption considers the cumulative influence of the contributions from the genes involved in more than one biological process (bicluster) active at a specific condition. Biclusters under a plaid assumption are thus able to compose contributions from multiple biclusters on areas where their rows and columns overlap. Genes can participate in multiple biological processes at a time and thus their expression can be seen as a composition of the contributions from the active processes. Biclustering with a plaid assumption allows the modeling of interactions between transcriptional modules based on overlapping activity levels. The plaid model defines biclusters (subsets of genes with coherent behavior across subsets of conditions) assuming an additive composition of contributions in the areas where they overlap with other biclusters. The authors in this paper propose BiP (Biclustering using Plaid models), a biclustering algorithm with relaxations to allow expression levels to change in overlapping areas according to biologically meaningful assumptions. Such plaid models are biologically significant and can unravel meaningful and non-trivial functional interactions between biological processes associated with the putative regulatory modules. The third paper, “Unsupervised Structure Detection in Biomedical Data” by Julia E. Vogt, presents an intuitive method based on ranked neighborhood comparisons (RaNC) that detects structure in unsupervised data. The method is based on ordering objects in terms of similarity and on the mutual overlap of nearest neighbors. Since the approach is based on ranking of nearest neighbors they call it Ranked Neighborhood Comparison. One interesting aspect about the method is that it doesn’t group data into strictly separated groups, but provides a network structure where the links between all objects remain preserved. Especially in biomedical data it is very likely that objects belong to more than just one group, e.g., genes might belong to more than one group by having more than one function. Since the approach doesn’t cut the structure into strictly separated groups, it is able to preserve this important information. They also show that the method is robust against outliers. Many biomedical data sets are frequently abundant with outliers either due to experimental or measurement noise. Hence, robustness to outliers is a very useful feature as there is no need to detect and remove outliers in advance in order to avoid incorrect results. As outliers have a higher distance to many data points, these points result in singletons in the graph, and they will not impair finding the underlying structure.
To conclude, the authors thank the authors and the reviewers for their contribution to this special section of the TCBB journal. They also thank Prof. Ying Xu and Prof. Dong Xu for their support as editors and assistance from the editorial staff at TCBB for making this special section possible.


