DCR: A fully transparent checkpoint/restart framework for distributed systems

Can Ma; Zhigang Huo; Jingnan Cai; Dan Meng

doi:10.1109/CLUSTR.2009.5289172

Abstract

Checkpoint/restart has been widely used in computing systems for fault tolerance, job scheduling and system maintenance purposes. However, the lack of transparency has hindered adoptions of many implementations of it. In this paper, we present a fully transparent parallel checkpoint/restart framework, DCR, which takes the advantages of kernel-level checkpointing method and TCP session preservation. DCR is fully transparent to application programmers and users. No source code modifications, recompilations, or system call interceptions are required. Because of the simplicity of its design and the dominance of TCP/IP in parallel applications, DCR can be readily deployed in widely scales of computers, from single CPU computers to large-scale clusters. A new on-demand blocking checkpoint protocol, which makes use of the reliability mechanism of TCP, is proposed to eliminate the global synchronization. We have demonstrated the effectiveness and efficiency of DCR by multiple MPICH2 applications running on Dawning 5000A.

DCR: A fully transparent checkpoint/restart framework for distributed systems

Authors

Abstract

Related Articles