Abstract
Unreliable failure detectors introduced by Chandra and Toueg are abstract mechanisms that provide information on process failures. On the one hand, failure detectors allow to state the minimal requirements on process failures that allow to solve problems that cannot be solved in purely asynchronous systems. But, on the other hand, they cannot be implemented in such systems: their implementation requires that the underlying distributed system be enriched with additional assumptions. The usual failure detector implementations rely on additional synchrony assumptions (e.g., partial synchrony). This paper proposes a new look at the implementation of failure detectors and more specifically at Chandra-Toueg?s failure detectors. The proposed approach does not rely on synchrony assumptions (e.g., it allows the communication delays to always increase). It is based on a query-response mechanism and assumes that the query/response messages exchanged obey a pattern where the responses from some processes to a query arrive among the (n - f) first ones (n being the total number of processes, f the maximum number of them that can crash, with 1 ≤ f < n). When we consider the particular case f = 1, and the implementation of a failure detector of the class denoted \diamondS (the weakest class that allows to solve the consensus problem), the additional assumption the underlying system has to satisfy boils down to a simple channel property, namely, there is eventually a pair of processes (pi, pj) such that the channel connecting them is never the slowest among the channels connecting pi or pj to the other processes. A probabilistic analysis shows that this requirement is practically met in asynchronous distributed systems.