Friday, March 13, 2015

Don't use the CAP theorem for node failures

"Dead nodes don't bite." - Robert Louis Stevenson (quoted from memory)

CAP is often described as a theorem you cannot avoid using. A common saying is “nodes fail, network packets get lost, partitions arise so you need to use CAP to choose the trade-offs.” Actually, the CAP scope is not that wide. Let’s look at why a process crash or a node failure are not partitions in CAP.

CAP - the usual reminder

CAP is a theorem, initially conjectured by Eric Brewer and proved in 2002 by Seth Gilbert and Nancy Lynch proved [C2]. Its catchy definition is: “Consistency, Availability, Partition tolerance: pick two.”

Let’s remind the definitions from the proof [C2].

Consistent is: “Atomic, linearizable, consistency [...]. There must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.”

Available is: “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.”

Partition is: “The network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost.”

Node failure vs. partition - the wrong intuition

A common and wrong intuition is: “from the point of view of the remaining nodes, it is impossible to distinguish a node that has failed from a node that is on another partition: in both cases, it does not respond nor sends any message. As a consequence, there is no difference between a network partition and a node failure.”

Let’s try this intuition: first against the definition of CAP, then against its proof, and then against real-life data store implementations.

CAP theorem: differentiating node failures vs. partition in the real world.

Is this crocodile dead or is it just sleeping?

There is no difference until you try to stroke it.

Checking vs. the CAP definition

The definition of availability requires “every request received by a non-failing node in the system must result in a response.” So, a node failure is, by definition, outside the scope of the CAP theorem.

That’s simple. But maybe the proof proves more than the definition, and implicitly covers the node failure case? Let’s check.

Checking vs. the CAP proof

The proof is described by Gilbert and Lynch in their paper [C2]. They prove that you cannot be both available and consistent during a partition by exhibiting a scenario that cannot respect all these properties.

The scenario is simple:

Let’s figure a system with two nodes: N1, N2.

step 0: a shared register receives an initial value: SR=0

step 1: N1 is partitioned from N2. By the partition definition, there can be no communication between the two nodes.

step 2: N1 writes a new value for SR: SR=1

step 3: N2 reads SR. It cannot get the latest value written by N1. Hence the system is either non consistent (N2 reads the old value) or not available (the writing by N1 fails or the reading by N2 fails).

This proof relies on having code running on both sides of the partition. If N2 is dead instead of being partitioned, the scenario cannot exist.

The CAP proof relies on having ‘real’ partitions. Hence the CAP theorem cannot be applied if the node is simply dead. Does it have any kind of importance in real life? Let’s check this as well.

Checking vs. the real world

Well, for the real world we just have to copy-paste what Michael Stonebraker said a while ago [C6]: “Now let us turn to single node failures. [...] You simply failover to a replica in a transactionally consistent way. Notably, at least Tandem and Vertica have been doing exactly this for years.”

There is nothing more to add. There are real-world systems that are both available and consistent when there is a node failure.

Back to our intuition

So our intuition was wrong. But why exactly? To answer, let’s go back to this specific sentence: “it is impossible to distinguish a node that has failed from a node that is on another partition.” This is true. The issue is:

if you want to be available, you will keep running all the nodes on the two partitions. The fact that one partition does not know if the other nodes are dead or partitioned is not a question: they work independently by construction.
if you want to be strongly consistent, you need to choose one partition as the available one. The nodes on the second one will kill themselves, or at least will not answer to requests. In this case, a partition will be equivalent to a failure, but this equivalence will be obtained by implementing a specific logic in all the nodes (quorums, masters, …).

This is an implication of the CAP theorem: if we want to cover any kind of operation as CAP does, the only solution to be consistent is to kill, logically or not, all the nodes that are not on the right side of the partition. In other words, in a strongly consistent system, the intuition is right, or, more precisely, actions are taken to make it right. However, in an AP system the intuition is just wrong.

Conclusion

A node failure is not a partition, and the CAP theorem says nothing about failures. As such, it should not be used when reasoning about node failures or server crashes.

	This post is part of the CAP theorem series
⇐ ACID and CAP wording		CAP and packet losses ⇒

References

[C1] Eric A. Brewer, PODC Keynote, July 19, 2000, Towards Robust Distributed Systems

[C2] Gilbert and Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News (2002)

[C6] Michael Stonebraker, Clarifications on the CAP Theorem and Data-Related Errors, 2010

Wednesday, March 11, 2015

Comparing Eventually Consistent and CP-as-in-CAP stores

Eventual Consistency (EC) is a well known concept. The CAP theorem, which defines Consistency, Availability and Partition tolerance is also well known. It describes some distributed systems such as CP, which has two properties: Consistency and Partition tolerance.

If we look at data stores just by how they stick to the definition of their consistency model and compare their speeds, which one would be the fastest? An EC or a CP store? Answering this question is a good excuse to go into some detail of their respective definitions and show some of their limits.

Contrary to popular belief, data stores can be very fast

Eventually Consistent

Eventually Consistent was defined by Werner Vogels in [E2]: "the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value."

Let’s try a few implementations of a hypothetical key value store.

Implementation 1: Trying the simplest option

void put(key, value){

// do nothing

}

value get(key){

// Hey, it’s easy to implement! We’re going

// to pretend that we have not yet received

// the write, it is enough!

throw “no value for this key”

}

Is this an EC store? No, it is not, because it will never return the last updated value. To be an EC store you need to “eventually [...] return the last updated value.” That’s not what this implementation does, it will never return the last updated value.

So let’s try another one.

Implementation 2: Trying to be fast for reads

void put(key, value){

doRealPut(key, value) // does the ‘real’ job

}

value get(key){

if (random(2) == 1) // 50% of the time

throw “no value for this key”

else

return doRealGet(key)

// As, 50% of the time, I pretend

// that I have not yet received

// the first insertion, I

// divide the latency by two!

}

Is this an EC store? Again, it is not because it does not respect the property “eventually all accesses will return the last updated value.” The “all accesses” condition is not met.

The definition does not prevent us from doing ridiculous things however, such as:

Implementation 3: Playing on words

void put(key, value){

doRealPut(key, value)

}

value get(key){

if (currentDate.year < 2020)

throw “no value for this key”

else

return doRealGet(key)

// I will win all read benchmarks until 2020!

}

This last implementation complies with the definition, but it’s playing on words. At the end of the day, the fact that we have to return the last value is important because we cannot trick the availability by inventing default initial values.

Lastly, a kind of hard real-time store, with strict SLA and a “make it in the expected time or stop” policy, would not comply with the EC definition:

Implementation 4: Trying EC for hard real-time systems

void put(key, value){

doRealPut(key, value)

}

value get(key){

try (timeLimit = 10 ms) { // run in 10ms or throw

return doRealGet(key)

} catch (TimeOutException) {

// Max response time is 10ms!

throw “no value for this key”

}

This kind of behavior is sometimes useful. However, an EC store is not allowed to do that: it breaks its consistency contract. As the second implementation, the “eventually all accesses will return the last updated” promise is broken.

Nevertheless, this behavior can be added on top of any storage, eventually consistent or not. Actually, one can expect that there already is an option to set a similar timeout on many data stores (again, eventually consistent or not).

CAP and CP

If you are reading this, you have been exposed to CAP and its catchy definition: “Consistency, Availability, Partition tolerance: pick two”. CAP was first a conjecture, made by Eric Brewer in 2000 [C1]. In 2002, Seth Gilbert and Nancy Lynch proved it [C2].

Let us use the definition from the proof [C2].

Available is: “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.”

This let us have a trivial implementation of a CP system:

Implementation 1: CP made easy

void put(key, value){

throw “not available”

}

value get(key){

throw “not available”

}

This “do nothing” implementation does comply with the definition of CP: it is not fully available--actually not available at all--but it is consistent. This is actually mentioned in [C2]: “If availability is not required, then it is easy to achieve atomic data and partition tolerance. The trivial system that ignores all requests meets these requirements.”

CAP and AP

Would it be different for an AP store? Not really: the CAP theorem does not require a minimal consistency for AP. [C2] mentions “It is possible to provide high availability and partition tolerance, if atomic consistency is not required. If there are no consistency requirements, the service can trivially return v0, the initial value, in response to every request.” It means we can use our minimal implementation that was not Eventually Consistent to do an available and partition tolerant store:

Implementation 2: AP is easy too

void put(key, value){

// do nothing

}

value get(key){

throw “no value for this key”

}

Is that a useful CP store?

Conclusion

The definition of the Eventual Consistency leads to a very clear behavior for any reasonable data store implementation. The definitions of AP and CP in CAP are much more permissive, and allow to build useless but perfectly CP or AP stores.

An immediate consequence is that any trivial application that crashes definitively when there is a partition gets the CP-partition-tolerance award.

	This post is part of the CAP theorem series
		ACID and CAP wording ⇒

(Edited version. Thanks a lot to Michael Stack for his comments).

References

[C1] Eric A. Brewer, PODC Keynote, July 19, 2000, Towards Robust Distributed Systems

[C2] Gilbert and Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News (2002)

[E2] Werner Vogels, Eventually Consistent, ACM Queue, Vol.6 No.6, 2008

Tuesday, March 10, 2015

The CAP theorem series

Let me introduce a new series of posts on the CAP theorem. CAP is a well known theorem conjectured and proven by recognized researchers in distributed systems, namely Eric Brewer, Seth Gilbert and Nancy Lynch. It is also widely used to categorize distributed applications.

A distributed system to be categorized with the CAP theorem

With CAP, do computer scientists actually have

a better classification tool than biologists?

Why you want to read this series on CAP theorem and databases

This series covers some points often forgotten or misunderstood and sometimes never mentioned, such as:

‘42’ is not a valid answer for an eventually consistent database;
Node failures and packet loss are not partitions;
Consistent in ACID and Consistent in CAP are two different things;
The CAP definition of availability is key but very different from what most people think;
There are some systems that do not fit into the three CA/CP/CA categories;
It’s possible to use the CA category, really possible,
There is no such thing as an eventually consistent and available big data database.

The posts, in recommended reading order

Comparing Eventually Consistent and CP-as-in-CAP stores: Introduces the definitions of “Consistent” and “Available” in the CAP theorem. By comparing them with the definition of “Eventual Consistency” we show the limits of these definitions.
The confusing ACID and CAP wording: Explains the terminologies of ACID and CAP, and how they overlap. Covers as well the difference between the theoretical definition of the ‘I’ in ACID and its actual implementation with multiple “isolation degrees”. It’s a good starting point if you know the “Consistency” definition of ACID but not the “Consistency” definition of CAP.
Don't use the CAP theorem for node failures: Looks at a common misunderstanding of the CAP theorem: node failures are not partitions. It’s also a pretext to look at the actual proof of CAP, and to go into the definition of “Partition” in CAP.
Don't use the CAP theorem for packet losses: Looks at another common misunderstanding. And it’s also a pretext to look at the network model used in the CAP theorem proof, and to show its difference with the TCP network protocol we know.
The unclear CP vs. CA case in CAP: Shows that there are some valid distributed systems that do not fit into any of the CAP categories. Explains what CA really means. Gives some examples on real-life systems, including the two-phase commit with heuristic decisions.
CAP, Availability, High-Availability and Big Data Databases in a World of Partitions: CAP and big data: shows that AP is impossible in an eventually consistent big data store. Explains the difference between high-availability and availability as defined by the CAP theorem. Contains a summary table with multiple types of distributed systems, their CAP-categories and their real availability.
CAP: if all you have is a timeout, everything looks like a partition: Explains how a ZooKeeper user can take advantage of the difference between a network partition and a process crash.
Real-time CAP: Delays can be included in the consistency/availability trade-off if we change the definition of availability to have timing requirements, like in a real-time system.
Forfeiting Partition Tolerance in Distributed Systems: Choosing CA makes sense... sometimes.

Copied from the storageMojo blog: courteous comments welcome, of course.