Monday, July 6, 2015

You Do It Too: Forfeiting Network Partition Tolerance in Distributed Systems

64-node clusters ought to be enough for anybody.” Bill Gates, quoted from memory.

This post is part of the CAP theorem series. You may want to start by my post on ACID vs. CAP if you have a database background but have never really been exposed to the CAP theorem. The post discussing some traps in the ‘Availability’ and ‘Consistency’ definition of CAP could also be used as an introduction if you know CAP but haven’t looked at its formal definition.


The CA--consistent, available, but not network partition tolerant--category in CAP has a very specific history. Not only forfeiting “network partition tolerance” can be understood as impossible in theory and crazy in practice (P as an illusion of a choice), but there is also an overlap between the CA and CP categories. As a result, many consider that it’s impossible to build a production CA system. But you can actually build a system without network partition tolerance, and sometimes you should.

A brief history of the CA category

Let’s look at the academic history of the CA category in CAP:
  • In 2000, Eric Brewer presents the CAP conjecture. In his presentation, CA exists, for example for systems using the two-phase commit protocol. He considers that “‹the whole space is useful.
  • In 2002, Seth Gilbert and Nancy Lynch publish the CAP proof. CA exists: "Systems that run on intranets and LANs are an example of these types of algorithms."
  • In 2010, Daniel Abadi raises the point that there is an overlap between CA and CP: "What does 'not tolerant' mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical."
  • Still in 2010, Michael Stonebraker publishes multiple documents around the limited importance of partitions, with the tagline “Myth #6: In CAP, choose AP over CA”, considering that with the capacity of modern hardware, small distributed systems can solve most real-life issues, and that "it doesn’t much matter what you do when confronted with network partitions."
  • And again in 2010, Coda Hale publishes a blog post: You cannot sacrifice partition tolerance, explaining that only AP and CP are possible.
  • This triggers a feedback from Stonebraker, who restates all his points.
  • 2 years later, in 2012, referring to these works, Eric Brewer states that “exactly what it means to forfeit P is unclear” and then clarifies: choosing CA should mean that the probability of a partition is far less than that of other systemic failures, such as disasters or multiple simultaneous faults.”


So we need to sort out the following issues:
  • There is an overlap between CP and CA.
  • There is a theoretical impossibility: network partitions are a given, you can choose between ‘A’ and ‘C’ when a partition happens but not if partitions happen.
  • There is a practical impossibility: network partitions are too likely to happen on a real life system to be ignored, so CA is impossible in practice.
What does CA mean?
CA is about “forfeiting network partition tolerance”, i.e. being “network partition intolerant”. Partition intolerance does not mean that network partitions cannot happen, it means network partitions are a critical issue. It’s a bit like gluten: being “gluten intolerant” does not mean that you cannot eat any, it means that you should not. Like for gluten, a CA system should also have a means of recovery should a network partition actually happen. The two-phase commit is a perfect example: it comes with a repair tool to fix the transactions broken by an heuristic resolution.
The fact that CA does not mean "I have a network that cannot be partitioned" is important, because it implies a partition can actually happen. This is stressed by Brewer: "choosing CA should mean that the probability of a partition is far less than that of other systemic failures." To estimate this probability you must be quite clear about what a partition actually is. This whole post is only about network partitions.
Let's summarize: CA describes the specification of an operating range, and not a behavior. CP, AP describe the behavior when a partition occurs. This obviously leaves room for an overlap between CP and CA. Let’s look at this overlap now.
The overlap between CP and CA
It's the point identified by Abadi: "What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical." A system that does not do anything once partitioned is trivially CP: it does not present a non-consistent history. Such a system could also be considered as CA: it stops working when there is a partition--hence the overlap. This overlap is minimal however:
  • Many CA systems are not CP: for example, the two-phase commit protocol is not consistent (nor available, nor ACID-atomic) when there is a partition.
  • Many CP systems are not CA: for example, a consensus server like ZooKeeper is totally tolerant to partitions.
Systems that belong to these two categories are only systems that stop working during the partition, but are consistent once the partition is fixed (trivially a webserver connected to a database). I personally prefer to call these systems ‘CA’ rather than ‘CP’, even if the CAP theorem allows for both: this expresses that a partition is a severe issue for the system. Ultimately, it’s your choice.
Partitions are a given in the CAP theorem
That’s exactly CAP: if there is a partition, you have to choose between ‘A’ and ‘C’. We have a model that allows partitions, and a theorem that says we have to choose ‘A’ or ‘C’ when there is a partition, so we cannot “refuse to see partitions”.  But actually “forfeiting partitions” is exactly that: it’s removing partitions from the model and building our application on a brand new model without partitions.


From a theoretical point of view, forfeiting partitions means removing them from the model. They will never happen in our theoretical model.

From a practical point of view, forfeiting partitions means removing them from the operating range. They may happen in reality.


By definition a model differs from reality. The question is always: is this model a good representation of reality?

Partitions happen too often in real life to be ignored

Well, here ended the debate between Coda Hale and Michael Stonebraker: Hale saying that there are a lot of partitions in his datacenters, Stonebraker saying that there are problems more probable than partitions that are not fixed anyway, and that surviving partitions will not “move the needle” on availability.
Without data agreed upon, there is no real way out from this debate. The good news is we don’t have to revive it to say that CA can be used to describe a distributed system: a CA system is a system built by someone who thinks he can forfeit partitions.
But the key point of the discussion is the difficulty to reason about failures without describing the system. In the debate above, Hale was speaking about systems of “any interesting scale”, while Stonebraker was considering small systems of high range servers on a LAN (“if you need 200 nodes to support a specific SQL application, then VoltDB can probably do the same application on 4 nodes”). But these two types of distributed systems are totally different animals. When discussing a design remember the old programming rule--“fancy algorithms are slow when n is small, and n is usually small”, and check the value of n.

When to use CA

The branch can be partitioned from the tree, but it
may not be the monkey’s main concern.


Let’s recall what Brewer wrote in 2012: choosing CA should mean that the probability of a partition is far less than that of other systemic failures, such as disasters or multiple simultaneous faults.”


Eric Brewer detailed in a mail he sent me (quoted here with his permission):
I tend to explain it a few different ways:
1) it is trivial to get CA in a non-distributed system, such as a single node
2) it is also fine to assume CA on a LAN, especially if it is (over) engineered for multiple paths or even for fail stop.  The CM-5 had an over-engineered network that would halt if it detected any errors, but it almost never did (in fact I don't know of case where it actually stopped, but there probably were some).  The CM-5 case thus really was an operating range argument.
3) If the probability of a partition is lower than other major system failures that would take out an application, then you can claim CA.  For example, you might lose a quorum due to correlated failures (such as power or a disaster), which would also lose availability even though not a partition.  If your network is 5 9s, you can probably ignore the partition case in terms of the code you write (but you should at least detect it!).


CA should mean that the probability of a partition is far less than that of other systemic failures” says we can call a system CA if the “probability of a partition “ is minimal--the non distributed or over-engineered network case. These systems are often not of “any interesting scale” but that doesn’t mean they don’t have any business value.


There is a more complex case: the probability of “multiple simultaneous faults” depends on many things, including the software itself. Many non-critical software are more likely to get a data corruption from a software bug than from a network partition, just because simple error scenarios like wrong user-inputs are not tested enough. A complicated administration interface is also a common source of downtime. In other words, choosing CA depends on the network quality and the software quality itself.


Network partition tolerance is a feature like any other. It has to be planned, implemented and tested. And, as any feature, the decision to implement it or not must take into account the benefits of the feature compared to its implementation cost. For such a feature it is:


expected number of partitions * cost per partition (unavailability, reputation, repair …)
vs.
cost of supporting partitions (testing effort included).


Even if the ratio is positive, i.e. the system should be partition tolerant, there could be other features that have a better ratio and they will be prioritized. That’s a well known engineering drama: it’s not because a feature is useful and brings value that it’s implemented in the end.


An example of such CA systems would be those GPU-based machine learning systems. The one built by Baidu was “comprised of 36 server nodes, each with 2 six-core Intel Xeon E5-2620 processors. Each server contains 4 Nvidia Tesla K40m GPUs and one FDR InfiniBand (56Gb/s) which is a high-performance low-latency interconnection and supports RDMA. The peak single precision floating point performance of each GPU is 4.29TFlops and each GPU has 12GB of memory.” For such a system, partition tolerance is not an imperious necessity: if a partition occurs the calculation can be restarted from scratch once the partition is fixed. As already stated, this does not mean partition tolerance is not useful. Partition tolerance would be typically useful should the calculation take weeks. But such systems can also exist without partition tolerance.

Conclusion: “the whole space is useful”

Being partition tolerant is comfortable. You have to be Stonebraker to claim partition intolerance. On the other hand, Kyle ‘’Aphyr’ Kingsbury proves regularly with smart but simple tests that many systems used in production are not network partition tolerant.

It’s not really that network partition tolerance can be easily forfeited, especially if the system is of “any interesting scale.” But first it is worth checking the system’s size: is it of “any interesting scale?” Exactly like a system that does not need to be distributed should not be distributed, a distributed system that can be kept small should be kept small.

There is also a catch in how CAP is sometimes (mis)understood: “node failures, processes crashes and network partitions are partitions so you have to be partition tolerant”. This is not only false but also dangerous: it hides the fact that each of these faults could be tackled independently with a specific priority. Before trying to be available during network partition, you should first validate that you don’t lose data with a single process crash. With fault tolerance like with any other problem, decomposing it makes it easier to fix. Network partition is just one type of fault out of many. 

So, sometimes using CA just makes sense. As already stated by Eric Brewer: “the whole space is useful.


Many thanks to Eric Brewer for his feedback. Errors are mine.
Updated 7/10/15 to clarify a few things and to make clear that this post is only about network partitions.


This post is part of the CAP theorem series
(coming soon!)

2 comments:

  1. Curious as to what you think of this response: https://aphyr.com/posts/325-comments-on-you-do-it-too

    ReplyDelete
  2. I updated the post to make clear that the partitions mentionned here are network partitions. That was clearly confusing. Also, there was no discussion around CP vs. AP. The discussion is around CA/CP. I changed a sentence that was unclear. And I added a few details here and there.

    CP and CA overlap because CA does not mean that partition are impossible. Brewer: "It is best to think about this probabilistically: choosing CA should mean that the probability of a partition is far less than that of other systemic failures."

    For the data itself, we need the probability of a partition. The paper pointed in the response does a pretty good job at showing that networks are not reliable, but not at estimating probabilities.

    And the same paper confirms you can't generalize anyway: "On the other hand, some networks really are reliable. Engineers at major financial firms have anecdotally reported that despite putting serious effort into designing systems that gracefully tolerate partitions, their networks rarely, if ever, exhibit partition behavior."

    I stick to my conclusion: “the whole space is useful.”

    ReplyDelete