Gray Failure

Posted on January 29, 2020

Gray Failure: The Achilles’ Heel of Cloud-Scale Systems is a delightful paper because it’s easy to read, the findings are directly useful in my day job, and because it reframes an important issue in a way which is not intuitive at first but simplifies it and leads to solutions.

All software fails from time to time, but evidence indicates that cloud scale services fail differently than other software which might be less distributed. The authors characterize this failure mode as “gray failure,” in which “component failures whose manifestations are fairly subtle, and thus defy quick and definitive detection.” This is in contrast to hard failure when a service simply stops working altogether. Examples include “severe performance degradation, random packet loss, flaky I/O, memory thrashing, capacity pressure, and non-fatal exceptions.”

So why is this distinction useful? Failure is failure, right? The authors maintain that, in contrast to hard failures, gray failure tends to increase very rapidly with scale. For example, consider network switches: “increasing redundancy can counter-intuitively hurtavailability because the more core switches there are, the more likelyat least one of them will experience a gray failure.”

This tells us several things. First, a gray failure might be a leading indicator of (in other words, it precedes) a hard failure. Second, the usual response to hard failure, restarting the service or host, may not help at all (as with the case of running out of disk space) and will probably make things much worse by introducing latency into the system. Third, horizontal scaling may make the problem considerably worse, as an increasing number of service hosts increases correspondingly the chances of a gray failure in any of them.

How to detect gray failures is one of the most important contributions of the paper. Gray failures are characterized by “differential observability.” This means that the consumer of a service, say, an application, will observe a different result than another entity which is looking at the service, such as a health check. So the health check might just be checking to see if the service is up, but might not notice that it’s running very slowly due to having consumed most of the hard drive space. The paper notes, “many gray failure cases we investigated are only detectable in a distributed fashion because each individual component has only a partial view of the entire system.” The authors advise detecting gray failure by “closing the observation gaps between the system and the apps it services.” In particular, they recommend “moving from singular failure detection (e.g., with heartbeats) to multidimensional health monitoring.”

If any of the above is interesting to you, I’d recommend reading the whole paper. It’s short (only five pages), and it’s packed with useful information.

Tags: failure, cloud, papers