Proceedings of the Third IEEE/ACM International Symposium on Cluster Computing and the Grid
Download PDF

Abstract

This paper considers the reliability of software Distributed Shared Memory systems where the unit of sharing is a persistent read-write object. We present an extended coherence protocol for causal consistency model, which integrates replication management with independent checkpointing. It uses a novel coordinated burst checkpoint operation in order to replicate consistent checkpoints of shared objects in local memory of distinct system nodes. No special reliable hardware devices are required. The protocol offers high availability of shared objects with limited overhead and ensures fast recovery in case of multi le node failures. In case of the network partitioning all the processes in a majority partition of the system can continuously access all the objects.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles