2009 IEEE International Conference on Cluster Computing and Workshops
Download PDF

Abstract

Checkpoint/restart has been widely used in computing systems for fault tolerance, job scheduling and system maintenance purposes. However, the lack of transparency has hindered adoptions of many implementations of it. In this paper, we present a fully transparent parallel checkpoint/restart framework, DCR, which takes the advantages of kernel-level checkpointing method and TCP session preservation. DCR is fully transparent to application programmers and users. No source code modifications, recompilations, or system call interceptions are required. Because of the simplicity of its design and the dominance of TCP/IP in parallel applications, DCR can be readily deployed in widely scales of computers, from single CPU computers to large-scale clusters. A new on-demand blocking checkpoint protocol, which makes use of the reliability mechanism of TCP, is proposed to eliminate the global synchronization. We have demonstrated the effectiveness and efficiency of DCR by multiple MPICH2 applications running on Dawning 5000A.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles