Abstract
High-Availability clusters typically present "single-system views," such that clients can interact with a cluster as if it were a single machine. Although beneficial to many clients, this abstraction prevents distributed applications, such as application servers, from obtaining information about components of their deployment that run on a cluster. This paper presents a protocol and architecture on the Sun™ Cluster system for delivering cluster events to clients running outside the cluster. The XML-based communication protocol allows dynamic client registration for, and subsequent asynchronous delivery of, cluster events. The underlying event delivery architecture utilizes the base Sun Cluster failure-recovery functionality, along with its own state recovery mechanisms, to handle all single points of failure, while still preserving information guarantees for the clients. The system has been implemented in Sun Cluster 3.1, and performance measurements show that it allows clients to detect "unplanned" server failures over 20 times faster than if they relied only on network timeouts.