Evaluation of process level redundant checkpointing/restart for HPC systems

Ifeanyi P. Egwutuoha; David Levy; Bran Selic

doi:10.1109/PCCC.2011.6108098

IEEE International Performance Computing and Communications Conference

Evaluation of process level redundant checkpointing/restart for HPC systems

Year: 2011, Pages: 1-2

DOI Bookmark: 10.1109/PCCC.2011.6108098

Authors

Ifeanyi P. Egwutuoha, Electrical & Information Engineering, The University of Sydney, NSW 2006, Australia
David Levy, Electrical & Information Engineering, The University of Sydney, NSW 2006, Australia
Bran Selic, Electrical & Information Engineering, The University of Sydney, NSW 2006, Australia

Abstract

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel custom architectures to clusters of commodity personal computers to take advantage of cost and performance benefits. To avoid having to restart an application in case of sudden failure, checkpointing/restart fault tolerance mechanisms are commonly implemented. One drawback to checkpointing/restart is that it creates an overhead which increases the execution time of an application. We present a theoretical analysis of our technique. The results show that the PLR checkpointing/restart can significantly improve the overall reliability of an HPC system.

Like what you’re reading?

Already a member?

Get this article FREE with a new membership!

Optimizing Energy Consumption on HPC Systems with a Multi-Level Checkpointing Mechanism
2017 International Conference on Networking, Architecture, and Storage (NAS)
Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems
Cluster Computing and the Grid, IEEE International Symposium on
TPLCR: Time-Bound, Pre-copy Live Checkpointing and Parallel Restart of Virtual Machines Using Distributed Memory Servers
2015 Third International Symposium on Computing and Networking (CANDAR)
Compiler-enhanced incremental checkpointing for OpenMP applications
2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training
2021 SC Workshops Supplementary Proceedings (SCWS)
Benchmarking Variables for Checkpointing in HPC Applications
2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
From tasks graphs to asynchronous distributed checkpointing with local restart
2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart
2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Evaluation of process level redundant checkpointing/restart for HPC systems

Authors

Abstract

Related Articles