Abstract
This paper considers the reliability of software Distributed Shared Memory systems where the unit of sharing is a persistent read-write object. We present an extended coherence protocol for causal consistency model, which integrates replication management with independent checkpointing. It uses a novel coordinated burst checkpoint operation in order to replicate consistent checkpoints of shared objects in local memory of distinct system nodes. No special reliable hardware devices are required. The protocol offers high availability of shared objects with limited overhead and ensures fast recovery in case of multi le node failures. In case of the network partitioning all the processes in a majority partition of the system can continuously access all the objects.