Thursday, April 16, 2015

CAP, Availability, High-Availability and Big Data databases in a world of partitions

“Those who would give up Consistency, to purchase a little temporary Availability, deserve neither Consistency nor Availability.” (B. Franklin, quoted from memory)

CAP theorem series
This post is part of the CAP theorem series. You may want to start by my post on ACID vs. CAP if you have a database background but have never really been exposed to the CAP theorem. The post discussing some traps in the ‘Availability’ and ‘Consistency’ definition of CAP should also be used as an introduction if you know CAP but haven’t looked at its formal definition.

This post, as was the previous one, is about using CAP to categorize distributed systems. It summarizes its results and extends them to “big data” distributed systems. Compared to other distributed systems, “big data” systems add another dimension: by definition, the data does not fit into a single node. As we will see, this has some implications.

CAP theorem and big data by A.E. van Vogt
Science Fiction authors are often ahead
of their time. Will this be the case here?

Looking at the terminology - CAP theorem & Availability

Here we will use the definitions from the proof of the CAP theorem  by Gilbert and Lynch [C2].

Consistent is [C2]: “Atomic, linearizable, consistency [...]. There must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.” We have seen that CAP consistency is different from ACID consistency.

Partition is [C2]: “The network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost.” We have seen that node failures and packet losses are not partitions.

Available is [C2]: “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.” From now on I will use CAP-Available to distinguish this definition from the other definitions of availability.

CAP defines availability clearly. The most common definition of availability we can find elsewhere is: availability is “time system available” divided by “total time”. There are two issues with this definition in the CAP context:
  • It defines availability without defining what ‘being available’ means.
  • CAP looks ‘into’ the system, as the system can be partitioned. It’s something not considered in the common definition, where the system is considered only as a whole.
However, it is possible to define high-availability and partial availability using a terminology comparable to the one used by CAP. That’s what we do next.

High-Availability: “Every request received by a subset of the non-failing nodes in the system must result in a response.
That’s usually how High-Availability is understood: whatever the state of the nodes--failing or non-failing, there is at least a subset of nodes that will answer all requests.

Partial-Availability: “Some requests received by a subset of the non-failing nodes in the system must result in a response.

Non-Availability: All requests fail.

Looking at CAP categories

Let’s recapitulate CAP categories:
  • CP: Consistent, but not CAP-Available.
  • AP: CAP-Available, but not Consistent.
  • CA: Consistent and CAP-Available, but not partition-tolerant. This specifies an operating range, as we saw here, so the overlap between CA and CP is not a real issue.
  • CA: Not Consistent and Not CAP-Available. For systems that are neither consistent nor CAP-Available (The notation with the overline comes from the ensemble theory).

CAP and Availability applied to distributed systems

Now let’s match this on CAP, and on some real systems when there is a network partition, in the table below. A ‘▲’ means that it’s the only or most probable case. A ‘△’ means it can happen but it’s a less probable case.

System description / CAP category
One web-server connected to one SQL database

One web-server connected to one Eventually Consistent database

Two-phase commit between multiple SQL databases

Consensus (Chubby, Zookeeper)

Typical CP big data system

Typical Eventually Consistent big data system
Ideal Eventually Consistent small data system

It should be read like this:
  • The first row means: “One web-server connected to one SQL database” is a CA system, not available during partitions.
  • Typical CP big data system” is a CP system, usually (the ‘’) highly-available during partition, but some partitions may lead to partial- or non-availability (the ‘’).

Now let’s have a look at the rows one by one.

CA/CP systems looked at in the previous post

Some systems were looked at in this previous post. Let’s go over the conclusions again.

Single web-server with a single database

This system can claim to belong to two categories:
  • CP (behavior when partitioned),
  • and CA (specifies that it does not handle partition).

So we can choose:
  • Saying CP is good for marketing: while the system does not work when there is a partition, it is tagged “partition-tolerant”. It’s great until there is a partition and the end-user discovers what you really meant by CP.
  • Saying CA highlights that network partitions are a real issue for the system, and that special attention should be given to the network. This seems to be the best category from a software engineering point of view.

Single web-server with an eventually consistent database

Nothing much to add to what was said in the previous post: there is no other option than CA as this system is never consistent, nor available during partition.

Multiple databases and the two-phase commit

As we saw in this previous post, this system can claim to belong to two categories:
  • CA (behavior when partitioned)
  • and CA (specifies that it does not handle partition).

CA is the best choice here, because:
  • The two-phase commit not only gives up consistency, but also other key properties (ACID atomicity).
  • The two-phase commit gives-up ACID atomicity only for the transactions that were in progress when the network partitions started, but will choose consistency-over-availability for all others cases.

Distributed Consensus

Nothing to add to what was said in the previous post: this is a typical example of a CP system.

A CP Big Data store

This one has multiple types of availability. For a single reason: a CP big data store can be highly-available for some partitions, but not for every partition. Some partitions will lead, at the very least, to a partial availability.

This is more or less intuitive: the data is big, so it can’t fit in all possible partitions. Then, depending on the network partition, we can find:
  1. One partition has a copy of all the data: that’s high-availability.
  2. None of the partitions have a copy of all the data, and the application using the big data store cannot cope with this situation. The system as a whole is non-available.
  3. None of the partitions have a copy of all the data, but the application can cope with missing data. That’s partial availability. This is application dependent as the application has to manage this case explicitly. Most of them won’t be able to handle such a situation.

To put some numbers on all this, let’s have a look at a real implementation. The logic in the math below is to use the most optimistic calculation: in most real-life cases the actual availability would be worse.

Doing the math on HDFS

HDFS, Hadoop Distributed File System. HDFS is a CP big data system: if the system is split into two partitions only one will continue to serve queries.

Its characteristics are:
  • Data is divided in blocks of 128Mb each.
  • Blocks are replicated synchronously to 3 nodes.
  • 1 copy goes to a rack and 2 copies to another.

All this is configurable, but the values mentioned here are commonly used. Let’s imagine we have 10 racks, with 200 Tb of data before replication for the whole cluster. That’s about 1.6 million HDFS blocks of 128 Mb each.

What if there is a network partition at the rack level?

There will be two partitions: one with racks 1 to 9, another with rack 10. With such a partition, the first 9 racks can continue to serve all requests: they have all the data, by construction, as all the blocks written in rack 10 were also copied in another rack. However, a client in rack 10 cannot send requests to these 9 racks, so the system as a whole is highly-available but not CAP-available.

Note that even if rack 10 has all the data, a client in rack 10 would still not be able to do any work locally: a CP system cannot have the two partitions running in parallel (that’s CAP).

What if we have a network partition with 50% of the racks in each partition?

To be highly-available, this partition needs at least one copy of each block. Now, what’s the probability of having at least one copy of all the data if we have only 50% of the racks? To calculate it let’s simplify and say that every block is replicated on 3 different racks. So the probability of having all copies in the other partition is:
  • first copy: 5 bad racks out of 10
  • second copy: 4 bad racks out of 9
  • third copy: 3 bad racks out of 8

This gives us, for a single block: (5/10) * (4/9) * (3/8), i.e. 8%. And so, for a single block, the probability that we have at least one copy in our main partition is around 92%. That’s good, but we have 1.6 million blocks. The probability to have at least one copy of all blocks is 0.921.6 million, which is practically zero. From a CAP categorization point of view, this does not change anything: it is still CP, as CP allows us to fail for some requests on any node. But the system is not highly-available anymore.

Is this a real-life issue? Yes, but there is too much data to copy it everywhere, so we need to pick the fights we can win: here we can support the loss of a node, of a rack, but nothing else.

Impact of not taking the topology into account

Racks require an extra level of configuration and make load-balancing more complicated. But we cannot really avoid it if we want to be tolerant to rack-level partition. For a single rack-loss on a cluster that does not take into account the topology, the same calculation as above shows we would be missing 1/1000 of the data on the first 9 racks. It’s still too much: the topology must be taken into account for all possible partitions you want to support.

An Eventually Consistent Big Data Store

As you can guess, the result is similar as for a CP big data store. Eventual Consistency or not:
  • If the topology is not taken into account, no partition can have all the data.
  • Even if the topology is taken into account, most network partitions will lead to a split between one partition with enough data and another partition without enough data. For these partitions, the system is highly-available, as a CP store, but not CAP-available: the second partition cannot handle all the requests (if any).
  • If the replication between data centers is configured, then for a partition between data centers, and only for this one, the system will be CAP-available.

The math

Looking again at a rack-loss scenario, would an Eventually Consistent Store be better than HDFS? In other words, what is the probability to have at least one copy of all the data in a partitioned rack?
For a single block, the probability that we have at least one copy in our partition is 1 - (9/10) * (8/9) * (7/8), around 30%. So this rack contains 30% of the total data.

In other words, while being eventually consistent would allow a client application to run on this partitioned rack, it will not have access to most of the data, so it will not be available. There is no difference between being strongly consistent and being weakly consistent: most client applications cannot reasonably run on this last rack. Both systems belong to the not-available category of CAP applications.

An Eventually Consistent AP system

While it is impossible for an “eventually consistent big data system” to be AP, it is also difficult for any eventual consistent system to belong to this category. This is because an AP system must:
  • have all the data in all the partitions;
  • continue to support writes during network partition, and this in all partitions.

AP in CAP: it does exist!
AP systems are rare, but they do exist

This is possible for configuration services, naming services and so on. Judging by its documentation, Netflix’s Eureka seems to be close to a real-life AP system. As it’s a service-discovery component, the data size remains small and can fit on any node. Also, they do local-writes-then-resolve-conflict: conflicts on service registrations are likely easy to merge, so the service can be quite reliable. CouchDB is a more generic database that also belongs to this category (Disclaimer: I have not used these systems myself).

However, even when it’s technically possible to build an Eventually Consistent AP system, some applications may have security constraints that will make disconnected operations difficult. There are requirements on “time to revoke access” that can be formulated in minutes for some business applications today. Access rights and authentication applications are a typical example of applications willing to choose “available” over “consistency”, but they are also limited by these “time to revoke access” requirements.

Conclusion and TL;DR

The CAP categorization used in marketing has little to do with what the CAP theorem actually is about. Most systems that describe themselves as CP are not all that not partition-tolerant, and most systems claiming to be AP are not all that available. Here are some quick takeaways:
  • CAP can be used for any distributed system, but the deployment should be specified and taken into account. For example choosing CA makes more sense on an LAN than on a WAN.
  • Availability is not limited to the CAP-Availability. Many real-life applications target high-availability and partial-availability.
  • CA should be used by distributed systems which want to communicate that network partitions are a severe issue (with documentation on what the issue actually is).
  • There is no such thing as an eventually consistent AP big data store.
  • Big data stores are not available under partitions if the network topology is not taken into account. Being eventually consistent does not save you from configuring the rack replication accordingly.

Many points I have raised here have already been mentioned by Eric Brewer in his 2012 paper [C3]. In his 2000 presentation [C1], he already went into the details of partially available data, using the concept of 'harvest'. Also, Daniel Abadi already pointed out in [C8]: “my main problem with CAP is that it focuses everyone on a consistency/availability trade-off, resulting in a perception that the reason why NoSQL systems give up consistency is to get availability. But this is far from the case.“ And, actually, most applications that gave up consistency didn't get much more availability in the end.

Thanks to Nick Dimiduk for his comments on an earlier version of this post. Errors are mine.


[C1] Eric A. Brewer, PODC Keynote, July 19, 2000, Towards Robust Distributed Systems
[C2] Gilbert and Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News (2002)
[C3] Eric A. Brewer:  “CAP Twelve Years Later: How the "Rules" Have Changed”, 2012
[C4] “Perspectives on the CAP Theorem”, Seth Gilbert, Nancy A. Lynch
[C8] Daniel J. Abadi: Consistency Tradeoffs in Modern Distributed Database System Design

No comments:

Post a Comment