Friday, March 13, 2015

Don't use the CAP theorem for node failures

"Dead nodes don't bite." - Robert Louis Stevenson (quoted from memory)

CAP is often described as a theorem you cannot avoid using. A common saying is “nodes fail, network packets get lost, partitions arise so you need to use CAP to choose the trade-offs.” Actually, the CAP scope is not that wide. Let’s look at why a process crash or a node failure are not partitions in CAP.

CAP - the usual reminder

CAP is a theorem, initially conjectured by Eric Brewer and proved in 2002 by Seth Gilbert and Nancy Lynch proved [C2]. Its catchy definition is: “Consistency, Availability, Partition tolerance: pick two.”

Let’s remind the definitions from the proof [C2].

Consistent is: “Atomic, linearizable, consistency [...]. There must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.”

Available is: “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.”

Partition is: “The network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost.”

Node failure vs. partition - the wrong intuition

A common and wrong intuition is: “from the point of view of the remaining nodes, it is impossible to distinguish a node that has failed from a node that is on another partition: in both cases, it does not respond nor sends any message. As a consequence, there is no difference between a network partition and a node failure.”

Let’s try this intuition: first against the definition of CAP, then against its proof, and then against real-life data store implementations.

CAP theorem: differentiating node failures vs. partition in the real world.

Is this crocodile dead or is it just sleeping?

There is no difference until you try to stroke it.

Checking vs. the CAP definition

The definition of availability requires “every request received by a non-failing node in the system must result in a response.” So, a node failure is, by definition, outside the scope of the CAP theorem.

That’s simple. But maybe the proof proves more than the definition, and implicitly covers the node failure case? Let’s check.

Checking vs. the CAP proof

The proof is described by Gilbert and Lynch in their paper [C2]. They prove that you cannot be both available and consistent during a partition by exhibiting a scenario that cannot respect all these properties.

The scenario is simple:

Let’s figure a system with two nodes: N1, N2.

step 0: a shared register receives an initial value: SR=0

step 1: N1 is partitioned from N2. By the partition definition, there can be no communication between the two nodes.

step 2: N1 writes a new value for SR: SR=1

step 3: N2 reads SR. It cannot get the latest value written by N1. Hence the system is either non consistent (N2 reads the old value) or not available (the writing by N1 fails or the reading by N2 fails).

This proof relies on having code running on both sides of the partition. If N2 is dead instead of being partitioned, the scenario cannot exist.

The CAP proof relies on having ‘real’ partitions. Hence the CAP theorem cannot be applied if the node is simply dead. Does it have any kind of importance in real life? Let’s check this as well.

Checking vs. the real world

Well, for the real world we just have to copy-paste what Michael Stonebraker said a while ago [C6]: “Now let us turn to single node failures. [...] You simply failover to a replica in a transactionally consistent way. Notably, at least Tandem and Vertica have been doing exactly this for years.”

There is nothing more to add. There are real-world systems that are both available and consistent when there is a node failure.

Back to our intuition

So our intuition was wrong. But why exactly? To answer, let’s go back to this specific sentence: “it is impossible to distinguish a node that has failed from a node that is on another partition.” This is true. The issue is:

if you want to be available, you will keep running all the nodes on the two partitions. The fact that one partition does not know if the other nodes are dead or partitioned is not a question: they work independently by construction.
if you want to be strongly consistent, you need to choose one partition as the available one. The nodes on the second one will kill themselves, or at least will not answer to requests. In this case, a partition will be equivalent to a failure, but this equivalence will be obtained by implementing a specific logic in all the nodes (quorums, masters, …).

This is an implication of the CAP theorem: if we want to cover any kind of operation as CAP does, the only solution to be consistent is to kill, logically or not, all the nodes that are not on the right side of the partition. In other words, in a strongly consistent system, the intuition is right, or, more precisely, actions are taken to make it right. However, in an AP system the intuition is just wrong.

Conclusion

A node failure is not a partition, and the CAP theorem says nothing about failures. As such, it should not be used when reasoning about node failures or server crashes.

	This post is part of the CAP theorem series
⇐ ACID and CAP wording		CAP and packet losses ⇒

References

[C1] Eric A. Brewer, PODC Keynote, July 19, 2000, Towards Robust Distributed Systems

[C2] Gilbert and Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News (2002)

[C6] Michael Stonebraker, Clarifications on the CAP Theorem and Data-Related Errors, 2010

5 comments:

GregMarch 31, 2015 at 12:05 AM
http://thislongrun.blogspot.com/2015/03/dead-nodes-dont-bite.html

The intuition you've quoted is right and I'm not quite sure why you think it's wrong.

In particular:

>“Now let us turn to single node failures. [...] You simply failover to a replica in
> a transactionally consistent way. Notably, at least Tandem and Vertica have
> been doing exactly this for years.”
>There is nothing more to add. There are real-world systems that are both
>available and consistent when there is a node failure.

That's true *in this failure mode*, but the choice of “Availability” in the CAP theorem is much stronger: being available requires a response as long as you have *any* working nodes in the system. The fact that some systems have consistent replica pairs that they can fail between does not make them both Consistent and Available in the CAP sense — it just makes them Consistent but with more uptime than the naive system.

Looking at [1], which I assume you’re drawing that quote from, it’s clear that Stonebreaker's main point is not that CAP is wrong, but that it encourages the wrong sort of engineering tradeoffs: yes, you might need to choose between C and A eventually, but your systems should be designed to delay the decision as long as possible — and in many applications, users won’t be able to tell the difference. But that doesn’t mean partition-tolerance is not equivalent to fault-tolerance.

In particular, Stonebreaker assumes that any double failures simply won’t happen so you shouldn’t worry about it in your system design. As a developer on a CP system (Ceph) with users who report issues to a mailing list, I am sad to report that for systems of scale he is wrong on this assumption. :(

> In other words, in a strongly consistent system, the intuition is right, or, more
> precisely, actions are taken to make it right. However, in an AP system the
> intuition is just wrong."
In an AP system you are tolerating a partition, by choosing to *not care* whether the nodes are partitioned or not. But a partition and a node failure are still indistinguishable to any monitoring nodes you have.
In a CP system you are choosing to be consistent and partition-tolerant, by guaranteeing that one side will win. But again down versus partitioned is indistinguishable to any monitoring nodes you have.
And that makes a failure and a partition indistinguishable from anybody else’s point of view.

Indeed, you circle back to this in your penultimate section, so I’m not sure why your lede and conclusion say otherwise.

Speaking more generally, even if partition tolerance and failure are not equivalent, you still have to design your system (in CAP terms) as if they are, choosing to give up either C or A:

Let’s say you did try and design a “CA” system. Let it consist of a set {N} nodes. Make some writes {W}. Let the set {M} nodes go down. Make some more writes {X} to the remaining {N}-{M} nodes.

Now let the {N}-{M} nodes die, and the {M} nodes turn back on. What happens to your system? Do the {M} functioning nodes return data based only on {W}, or do they refuse service? Does this simply count as every node being failed until one of {N}-{M} is back?
How do those answers differ from applying the same scenario as a partition?

[1]: http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
ReplyDelete
Replies
AnonymousMay 25, 2015 at 11:52 PM

> The definition of availability requires “every request received by a non-failing node in the system must result in a response.” So, a node failure is, by definition, outside the scope of the CAP theorem.

Well, it's potentially outside the scope of the definition of availability. But it's not clear here that non-failing refers specifically to hardware failure, VM failure, process failure, network failure, etc. It reads to me like the author is carving out some leeway for sensibility in saying a node must respond. That *of course* a failing (in some sense) node can't respond.

> This proof relies on having code running on both sides of the partition. If N2 is dead instead of being partitioned, the scenario cannot exist.

That's not how I read it. Step 3 merely requires that N2 fail to read. If N2 has died, lost its network connection, or paused for 1 year, then it will fail to read and satisfy step 3.
ReplyDelete
Replies

Add comment