Friday, March 13, 2015

Don't use the CAP theorem for node failures

"Dead nodes don't bite." - Robert Louis Stevenson (quoted from memory)



CAP is often described as a theorem you cannot avoid using. A common saying is “nodes fail, network packets get lost, partitions arise so you need to use CAP to choose the trade-offs.” Actually, the CAP scope is not that wide. Let’s look at why a process crash or a node failure are not partitions in CAP.

CAP - the usual reminder

CAP is a theorem, initially conjectured by Eric Brewer and proved in 2002 by Seth Gilbert and Nancy Lynch proved [C2]. Its catchy definition is: “Consistency, Availability, Partition tolerance: pick two.”

Let’s remind the definitions from the proof [C2].
Consistent is: “Atomic, linearizable, consistency [...]. There must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.

Available is: “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.

Partition is: “The network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost.

Node failure vs. partition - the wrong intuition

A common and wrong intuition is: “from the point of view of the remaining nodes, it is impossible to distinguish a node that has failed from a node that is on another partition: in both cases, it does not respond nor sends any message. As a consequence, there is no difference between a network partition and a node failure.

Let’s try this intuition: first against the definition of CAP, then against its proof, and then against real-life data store implementations.


CAP theorem: differentiating node failures vs. partition in the real world.
Is this crocodile dead or is it just sleeping?
There is no difference until you try to stroke it.

Checking vs. the CAP definition

The definition of availability requires “every request received by a non-failing node in the system must result in a response.”  So, a node failure is, by definition, outside the scope of the CAP theorem.

That’s simple. But maybe the proof proves more than the definition, and implicitly covers the node failure case? Let’s check.

Checking vs. the CAP proof

The proof is described by Gilbert and Lynch in their paper [C2]. They prove that you cannot be both available and consistent during a partition by exhibiting a scenario that cannot respect all these properties.

The scenario is simple:
Let’s figure a system with two nodes: N1, N2.
step 0: a shared register receives an initial value: SR=0
step 1: N1 is partitioned from N2. By the partition definition, there can be no communication between the two nodes.
step 2: N1 writes a new value for SR: SR=1
step 3: N2 reads SR. It cannot get the latest value written by N1. Hence the system is either non consistent (N2 reads the old value) or not available (the writing by N1 fails or the reading by N2 fails).

This proof relies on having code running on both sides of the partition. If N2 is dead instead of being partitioned, the scenario cannot exist.

The CAP proof relies on having ‘real’ partitions. Hence the CAP theorem cannot be applied if the node is simply dead. Does it have any kind of importance in real life? Let’s check this as well.

Checking vs. the real world

Well, for the real world we just have to copy-paste what Michael Stonebraker said a while ago [C6]: “Now let us turn to single node failures. [...] You simply failover to a replica in a transactionally consistent way. Notably, at least Tandem and Vertica have been doing exactly this for years.

There is nothing more to add. There are real-world systems that are both available and consistent when there is a node failure.

Back to our intuition

So our intuition was wrong. But why exactly? To answer, let’s go back to this specific sentence: “it is impossible to distinguish a node that has failed from a node that is on another partition.” This is true. The issue is:
  • if you want to be available, you will keep running all the nodes on the two partitions. The fact that one partition does not know if the other nodes are dead or partitioned is not a question: they work independently by construction.
  • if you want to be strongly consistent, you need to choose one partition as the available one. The nodes on the second one will kill themselves, or at least will not answer to requests. In this case, a partition will be equivalent to a failure, but this equivalence will be obtained by implementing a specific logic in all the nodes (quorums, masters, …).

This is an implication of the CAP theorem: if we want to cover any kind of operation as CAP does, the only solution to be consistent is to kill, logically or not, all the nodes that are not on the right side of the partition. In other words, in a strongly consistent system, the intuition is right, or, more precisely, actions are taken to make it right. However, in an AP system the intuition is just wrong.

Conclusion

A node failure is not a partition, and the CAP theorem says nothing about failures. As such, it should not be used when reasoning about node failures or server crashes.




References
[C1] Eric A. Brewer, PODC Keynote, July 19, 2000, Towards Robust Distributed Systems
[C2] Gilbert and Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News (2002)
[C6] Michael Stonebraker, Clarifications on the CAP Theorem and Data-Related Errors, 2010

5 comments:

  1. http://thislongrun.blogspot.com/2015/03/dead-nodes-dont-bite.html

    The intuition you've quoted is right and I'm not quite sure why you think it's wrong.

    In particular:

    >“Now let us turn to single node failures. [...] You simply failover to a replica in
    > a transactionally consistent way. Notably, at least Tandem and Vertica have
    > been doing exactly this for years.”
    >There is nothing more to add. There are real-world systems that are both
    >available and consistent when there is a node failure.

    That's true *in this failure mode*, but the choice of “Availability” in the CAP theorem is much stronger: being available requires a response as long as you have *any* working nodes in the system. The fact that some systems have consistent replica pairs that they can fail between does not make them both Consistent and Available in the CAP sense — it just makes them Consistent but with more uptime than the naive system.

    Looking at [1], which I assume you’re drawing that quote from, it’s clear that Stonebreaker's main point is not that CAP is wrong, but that it encourages the wrong sort of engineering tradeoffs: yes, you might need to choose between C and A eventually, but your systems should be designed to delay the decision as long as possible — and in many applications, users won’t be able to tell the difference. But that doesn’t mean partition-tolerance is not equivalent to fault-tolerance.

    In particular, Stonebreaker assumes that any double failures simply won’t happen so you shouldn’t worry about it in your system design. As a developer on a CP system (Ceph) with users who report issues to a mailing list, I am sad to report that for systems of scale he is wrong on this assumption. :(

    > In other words, in a strongly consistent system, the intuition is right, or, more
    > precisely, actions are taken to make it right. However, in an AP system the
    > intuition is just wrong."
    In an AP system you are tolerating a partition, by choosing to *not care* whether the nodes are partitioned or not. But a partition and a node failure are still indistinguishable to any monitoring nodes you have.
    In a CP system you are choosing to be consistent and partition-tolerant, by guaranteeing that one side will win. But again down versus partitioned is indistinguishable to any monitoring nodes you have.
    And that makes a failure and a partition indistinguishable from anybody else’s point of view.

    Indeed, you circle back to this in your penultimate section, so I’m not sure why your lede and conclusion say otherwise.

    Speaking more generally, even if partition tolerance and failure are not equivalent, you still have to design your system (in CAP terms) as if they are, choosing to give up either C or A:

    Let’s say you did try and design a “CA” system. Let it consist of a set {N} nodes. Make some writes {W}. Let the set {M} nodes go down. Make some more writes {X} to the remaining {N}-{M} nodes.

    Now let the {N}-{M} nodes die, and the {M} nodes turn back on. What happens to your system? Do the {M} functioning nodes return data based only on {W}, or do they refuse service? Does this simply count as every node being failed until one of {N}-{M} is back?
    How do those answers differ from applying the same scenario as a partition?


    [1]: http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors

    ReplyDelete
    Replies
    1. Thanks for the feedback.

      > The intuition you've quoted is right and
      > I'm not quite sure why you think it's wrong.
      Because the CAP proof does not hold if the node is dead instead of partitioned :-). A part of this post is just about "applying the right theorem to the right problem". One should not use CAP when reasoning about dead nodes.

      > That's true *in this failure mode*, but the choice
      > of “Availability” in the CAP theorem is much stronger
      I agree. Here I'm just speaking for node failures. If you want to support both node failures and network partitions you have some extra work. This post is not about saying CAP is wrong. It's about saying a node failure is not a partition in CAP.


      > Now let the {N}-{M} nodes die, and the {M} nodes
      > turn back on. What happens to your system?
      There are many possible strategies here. A simple one is to consider a node that turns back on as a plain new node. That's what Bigtable/HBase do.

      > But again down versus partitioned is indistinguishable
      > to any monitoring nodes you have that
      > makes a failure and a partition indistinguishable
      *Some* failures are indistinguishable from partitions. Many failures are actually distinguishable. In many CP systems, if your monitoring system does the distinction, your MTTR 99 percentile will be improved by a huge factor, just because you will start the recovery sooner instead of relying on a timeout.

      > even if partition tolerance and failure are not equivalent,
      > you still have to design your system (in CAP terms)
      > as if they are,
      If partition tolerance and failure are not equivalent:
      1) On the theoretical side, it means you can build a CP system that will handle node failures but not network partitions. I don't say you should do that. I say you can, and, if your cluster is 20 nodes max on a LAN, it may be a reasonable choice.
      2) You can optimize your system to manage differently the partition case vs. the node failure case. Typically, any lock or resource held by a dead node can be recovered immediately. If it's a partition then deciding on what to do is more complex with more trade-offs.

      It's something I plan to write about in another post, but, CAP hides the importance of failure detectors/oracles. For a consistency point of view, you can handle node failures as partitions; but from a latency point of view you need to do the distinction.

      Delete

  2. > The definition of availability requires “every request received by a non-failing node in the system must result in a response.” So, a node failure is, by definition, outside the scope of the CAP theorem.

    Well, it's potentially outside the scope of the definition of availability. But it's not clear here that non-failing refers specifically to hardware failure, VM failure, process failure, network failure, etc. It reads to me like the author is carving out some leeway for sensibility in saying a node must respond. That *of course* a failing (in some sense) node can't respond.

    > This proof relies on having code running on both sides of the partition. If N2 is dead instead of being partitioned, the scenario cannot exist.

    That's not how I read it. Step 3 merely requires that N2 fail to read. If N2 has died, lost its network connection, or paused for 1 year, then it will fail to read and satisfy step 3.

    ReplyDelete
    Replies
    1. > The definition of availability requires “every request received by a non-failing node in the system must result in a response.”

      More specifically, this establishes that nodes do fail in the system, yet availability can be maintained. It's says you must deal with node failure, and the primary failure mode is failure-to-respond. The specific failure behind failure-to-respond is more of an engineering concern than theoretic.

      Delete
    2. I'm not sure I got your point.
      > If N2 has died [] then it will fail to read and satisfy step 3.
      If N2 is dead there is no risk to be inconsistent: N2 cannot do anything. On a partition you need to choose.

      > More specifically, this establishes that nodes do fail in the system, yet availability can be maintained.
      The way the proof and the theorem is written is basically "we're not interest by failed nodes: failed nodes and nodes that cannot receive request are excluded from the scope of our analysis." But yes, in real-life we have to manage them (and CAP does not help us here).

      Delete