A reliability-aware approach for an optimal checkpoint/restart model in HPC environments

Yudan Liu; Raja Nassar; Chockchai Leangsuksun; Nichamon Naksinehaboon; Mihaela Paun; Stephen Scott

doi:10.1109/CLUSTR.2007.4629264

2007 IEEE International Conference on Cluster Computing

A reliability-aware approach for an optimal checkpoint/restart model in HPC environments

Year: 2007, Pages: 452-457

DOI Bookmark: 10.1109/CLUSTR.2007.4629264

Authors

Yudan Liu, College of Engineering&Science, Louisiana Tech University, Ruston, LA 71270, USA
Raja Nassar, College of Engineering&Science, Louisiana Tech University, Ruston, LA 71270, USA
Chockchai Leangsuksun, College of Engineering&Science, Louisiana Tech University, Ruston, LA 71270, USA
Nichamon Naksinehaboon, College of Engineering&Science, Louisiana Tech University, Ruston, LA 71270, USA
Mihaela Paun, College of Engineering&Science, Louisiana Tech University, Ruston, LA 71270, USA
Stephen Scott, Computer Science and Mathematics Division, Oak Ridge National Laboratory, TN 37831, USA

Abstract

The increase in the physical size of High Performance Computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy towards minimizing rollback and checkpoint overheads. Our scheme aims to address fault tolerance challenge especially in a large-scale HPC system by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can perform a varying checkpoint interval and deal with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.

Like what you’re reading?

Already a member?

Get this article FREE with a new membership!

Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment
Cluster Computing and the Grid, IEEE International Symposium on
An optimal checkpoint/restart model for a large scale high performance computing system
2008 IEEE International Parallel & Distributed Processing Symposium
The Checkpoint Interval Optimization of Kernel-Level Rollback Recovery Based on the Embedded Mobile Computing System
Computer and Information Technology, IEEE 8th International Conference on
Two-level checkpoint/restart modeling for GPGPU
2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA)
Checkpoint/restart in practice: When ‘simple is better’
2014 IEEE International Conference On Cluster Computing (CLUSTER)
Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off
2005 IEEE International Conference on Cluster Computing
Optimal age-dependent checkpoint strategy with retry of rollback recovery
Proceedings 2nd International Workshop on Autonomous Decentralized System
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
Cluster Computing and the Grid, IEEE International Symposium on
Efficient Encoding and Reconstruction of HPC Datasets for Checkpoint/Restart
2019 35th Symposium on Mass Storage Systems and Technologies (MSST)
Checkpoint Restart Support for Heterogeneous HPC Applications
2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)

A reliability-aware approach for an optimal checkpoint/restart model in HPC environments

Authors

Abstract

Related Articles