Sunday, April 5, 2015

The unclear CP vs. CA case in CAP

This post is part of the CAP theorem series. You may want to start by my post on ACID vs. CAP if you have a database background but have never really been exposed to the CAP theorem. The post discussing some traps in the ‘Availability’ and ‘Consistency’ definition of CAP should also be used as an introduction if you know CAP but haven’t looked at its formal definition.

CAP is a theorem, but it is also widely used to categorize distributed systems and to explain their trade-offs. It looks great: “Hey, it’s based on a theorem! This is scientifically proven!”

However, the meaning of these categories--AP, CP, CA--is not that clear, leading to an opposite result: the CAP categorization says very little on how the distributed system actually behaves. For example, look at the CA category--Consistent and Available but not Partition-tolerant. A widely accepted opinion is: “Partition cannot be avoided. Your distributed system can just choose to sacrifice either consistency or availability when a partition occurs. So a distributed system cannot be CA, by definition.” Indeed, Eric Brewer himself, wrote in 2012 [C3]: “Does choosing consistency and availability (CA) as the ‘2 of 3’ make sense? As some researchers correctly point out, exactly what it means to forfeit P is unclear.” Should one use the CA category then?

A slide saying that some systems are CA. Does

its author really understand CAP?

To answer this question, we’re going to use CAP to categorize a few different distributed systems:

A simple application using a single server with a traditional SQL database.
The same application, but with an Eventually Consistent store instead of our traditional SQL one.
An application using multiple databases and the two-phase commit protocol.
A distributed consensus server, like Apache ZooKeeper or Google Chubby.

Also, as we know, using CAP in a database context can be confusing. In this post, I always use CAP terminology and not ACID terminology. When I say “consistent” or “strongly consistent”, it’s always for the consistency-as-in-CAP. Same for “available”: it always means available-in-CAP.

CAP - the usual reminder

CAP says that a distributed system cannot be Consistent, Available and Partition tolerant.

Here we will use the definitions from the proof of the CAP theorem by Gilbert and Lynch [C2]. They are important. Many CAP misunderstandings come from using very intuitive but very wrong definitions.

Consistent is [C2]: “Atomic, linearizable, consistency [...]. There must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.”

Available is [C2]: “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.”

Partition is [C2]: “The network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost.”

A simple database application and CAP

Let’s look at a very simple application, distributed on two nodes with a traditional SQL database. The (made-up) specification of this application is: “This application allows you to fill and save forms in a web browser. It needs a database to run.” The deployment is simple: on one machine, a web-server; on a second one, an Relational Database Management System (RDBMS, i.e. a traditional database offering ACID properties). Between the two, a network. The web-server reads and writes from the database.

Is this a CP or an AP application? Well, when there is no partition, we are strongly consistent. When there is a partition, we are not available at all, so we remain consistent. As a consequence we are CP.

That “not being available at all is enough to be CP” can be surprising, but is actually mentioned in the proof itself [C2]: “If availability is not required, then it is easy to achieve atomic data and partition tolerance. The trivial system that ignores all requests meets these requirements.”

If you are strongly consistent before the partition, and not available after, you are CP. For this application a partition does not lead to a choice between ‘A’ and ‘P’ but leads to an immediate shutdown of the overall service. This, by the CAP theorem, means CP. Obviously, most trivial applications are CP.

Most pictures on the Internet are cat pictures. And most

distributed systems you find there are CP.

The same application with an Eventually Consistent store

What is the importance of using an RDBMS in this application?

After the partition it is not important at all. The application is not available because it cannot access the database. The consistency model of this database has no importance at this stage.

Before the partition it is important for our classification: if, when there is no partition, the application is not strongly consistent then it cannot claim to be CP. So, if instead of using an RDBMS the application stores its data in an Eventually Consistent store, the system as a whole is not CP. What is it then? Let’s see what we have:

CP: no, because the application is not strongly consistent.
AP: no, because the application is not available during a partition.
CA: no, because the application is not strongly consistent.

So this application is nothing from a CAP point of view. It would still be a valuable application, maybe more fault-tolerant than the one using an RDBMS, but it still wouldn’t fit into any CAP category. Using an Eventually Consistent store but being unavailable during partition puts you out of the CAP categorization.

CAP theorem and categories: some names are missing

Some Distributed Systems Are Not to be Named

(by CAP at least)

CP and ACID-Consistency

Strictly speaking, the reliability added by the database, typically an RDBMS, is not important when looking at the CP category: even if the database is corrupted during the partition and the application cannot be restarted when the partition is fixed, it would still be in the CP “partition-tolerant” category. It is an extreme case, but it shows that using an RDBMS does not impact the “partition-tolerance” of the categorized application. What matters is the behavior before partition and the fact that the application is not available during partition.

Applications using two-phase commit with SQL databases

The objective of the 2 phase-commit (2PC) is to have a single ACID transaction on multiple nodes, whatever happens (read: failure and partition). So we have a distributed system with multiple databases and multiple clients. These clients can do simple transactions--going to a single node--or distributed transactions--going to multiple nodes. Transactions going on multiple nodes will use the two-phase commit protocol.

Here is a simplified description of how the two-phase commit works:

the transaction starts on multiple nodes. One of the nodes is the coordinator.
during the first phase, the coordinator asks all nodes if they can commit the transaction. This means they have the resources available (enough disk space for example), that all the integrity constraints are valid and so on. All the nodes must send their result back. If they say “yes” at this stage, they cannot change their minds later. So they need to keep all the resources available, locks included. Again: they are not allowed to fail or rollback once they have replied “yes” to the coordinator.
if one of the nodes says “no”, the coordinator asks for a rollback on all nodes and we are done.
if all the nodes say “yes”, the coordinator asks all nodes to commit their part of the transaction and we are done.

Two-phase commit with partition

That looks simple. But what happens if there is a partition between the database nodes during the commit phase? Then, one or more of the nodes may not receive the final decision from the coordinator. The node can choose between holding all locks and resources (i.e. being not available) or taking a local decision. This local decision is called an “heuristic decision”. This means the node either commits locally or rollbacks locally, but does something and does it independently from the others nodes.

In CAP terms, heuristic decisions are here to improve the availability: instead of locking indefinitely, it continues. But there is no free meal, and this increased availability comes with a drawback. Here, different nodes can take different decisions: one node can commit half of the transaction, while the other node can rollback its second half. For example, if you are using two-phase commit to book a single round-trip ticket with two different airline companies, you may end up with half of the flight booked.

This is a known issue. For example, the jboss documentation says: “It is one of the worst errors that may happen in a transaction system, as it can lead to parts of the transaction being committed while other parts are rolled back, thus violating the atomicity property of transaction and possibly leading to data integrity corruption.“ As this documentation says, we broke the Atomicity-in-ACID property of the transaction. In our example, we can also consider we broke the Consistency-in-ACID: a round-trip ticket should have two flights attached but we have only one.

So heuristic decisions in the two-phase commit increase the availability by making the transaction non-ACID. But what are we from a CAP point of view? Are we CP or AP?

Categorizing the two-phase commit

The easy part first: any operation involving the two nodes is no longer possible. So, obviously, the system is not available-as-in-CAP: some (but not all, and that’s the whole point) operations will fail, and the ‘A’ of CAP does not allow this. So this system is not AP.

Are we CP? Neither, because there are some scenarios that will show an incoherent history. For example in our flight application, we may imagine a single transaction doing this:

store the first flight in node A.
store the second flight in node B.
store the whole ticket, with references to the two flights, in node A.

If node A commits and node B rollbacks, node A sees a history that does not exist: the second flight existed and disappeared. That’s not consistent-in-CAP.

So we’re not AP, and we’re not CP. We now have the choice between “Is Not To Be Named” and CA. Let’s try CA first.

The unclear case for CA

What about using CA here? That was the point of view of the author of the slide used in the introduction. Here it is again:

Brewer's original presentation on CA: forfeit partition

Slide from Eric Brewer’s presentation

of the CAP conjecture in 2000

We know that we cannot avoid network partition. What if we specify our application like this: “this application does not handle network partition. If it happens, the application will be partly unavailable, the data may be corrupted, and you may have to fix the data manually.” In other words, we’re really asking to be CA here, but if a partition occurs we may be CP, or, if we are unlucky, both not available and not consistent.

Saying this implies that CA and CP/AP are not about the same face of a distributed system:

CA is a specification of the operating range: you specify that the system does not work well under partition or, more precisely, that partitions are outside the operating range of the system.
CP or AP describes the behavior: what happens if there is a partition.

We could draw a parallel with the specification of a software. Let’s imagine a simple application written and tested with Postgres, and then the dialogue between a software vendor and his customer:

	Here is the application you paid for. It works on Postgres.
		Thanks, but I’m running Oracle here.
	We don’t have a license for Oracle. We can’t support you on this platform. It’s clearly specified, see page 327 of appendix E.
(time goes by)
		FYI: I did some tests on Oracle, it seems ok. It goes in production tomorrow. I will let you know if I have any issues.
	But we won’t be able to fix them!
		This you will have to discuss with our lawyers. But I tested many cases, I’m confident it works well enough.

At the end of the day, does this application work on Oracle or not?

You can look at a software by its specification or by its actual behavior. They sometimes diverge. And the specified operating range may be applied or not. Depending on the importance of the application, you may offer a degraded mode, or a repair tool, or go crazy, or hope for the best, or whatever.

In the case of a distributed system, you can specify that you don’t handle well network partition. It does not prevent network partition to happen, it just says that network partition could break all or some parts of the software promises.

From this point of view, the two-phase commit is CA: it tries to minimize the probability for a partition to have an impact on the system. However, if this happens, the behavior is documented.

CA and CP overlap because they do not describe the same face of a distributed system. CA specifies an operating range. CP/AP describes what happens during partition, even if partitions are outside the operating range.

CA in CAP: the world where you can dodge partitions

The CA category does not belong to the same world

as the CP and AP ones.

Distributed Consensus Server

A typical example of such an application is Chubby [V4]. Consensus systems are typically used to agree on a unique value of a data between distributed nodes. These nodes can fail or be partitioned.

Chubby has the following characteristics:

It contains 5 nodes.
A master is elected. It must get at least 3 votes: this is called the quorum.
All reads and writes go through the master.
If the master fails, a new election takes place.
Clients can go to any node. If they go to a node that is not the master, this node will redirect them to the master.

This is a simplified view: for example, there can be fewer or more nodes. The heart of this mechanism is the vote. Chandra and al. proved that these systems work and and will continue to work if there is a partition or a crash of a minority of servers [V3]: the master can be elected if there are 3 voters, so the protocol tolerates the failure or the partition of 2 nodes.

Is this available? No, because of the definition of availability in [C2]: “every request received by a non-failing node in the system must result in a response.” If there is a partition, some nodes will be non-failing, but will not be able to fulfill any query they receive. CAP is about being fully available: all nodes must continue to serve all queries. CAP does not make any difference between responding on some nodes and not responding at all: these two system types belong to a single CAP category, CP.

Some cats are more expensive than

others, but they are still cats.

Conclusion

Using the CAP theorem to categorize CA and CP distributed systems shows:

CAP lacks some categories: some valid systems may be “non-consistent-in-CAP” and “non-available-in-CAP”, especially because of the highly demanding definition of availability in CAP.
Partition does not always imply a binary choice between availability and consistency. Other choices are possible (Eric Brewer said this already [C3]). However, these choices put you out of the CP/AP categories, because Availability-in-CAP means “all nodes” and “all requests”.
Using CA can be “not clear” but is not ridiculous. I’m personally happy to consider it as the specification of an operating range, with CP/AP being the description of a behavior.
CP catches more than it should. It catches:

Applications “consistent” and “partly available” during a partition.
Applications “consistent” but “not available” at all during a partition. One could consider that these applications are actually CA: they could specify that they do not handle partitions.
Applications breaking their “consistency-in-ACID” constraints during or after a partition. One could consider that these applications are actually CA: that’s the two-phase commit example.

This is a partial conclusion. The next post is about the AP category.

	This post is part of the CAP theorem series
⇐ CAP and packet losses		Availability in CAP ⇒

Thanks to Marc Shapiro for his comments on an earlier version of this post. Errors are mine.

References

[C1] Eric A. Brewer, PODC Keynote, July 19, 2000, Towards Robust Distributed Systems

[C2] Gilbert and Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News (2002)

[C3] Eric A. Brewer: “CAP Twelve Years Later: How the "Rules" Have Changed”, 2012

[V3] T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, 1996.

[V4] Mike Burrows, The Chubby lock service for loosely-coupled distributed systems , OSDI, 2006

Thursday, March 26, 2015

The confusing CAP and ACID wording

CAP and ACID share a common vocabulary: Atomic, Consistent, and so on. But there is a catch: the words are the same but they mean totally different things. CAP comes from the distributed systems theory, while ACID belongs to database systems one.The Distributed Databases use both CAP and ACID vocabulary, so this obviously creates a lot of confusion. When someone says: “one should not give up consistency” what’s the consistency he is speaking about? Let’s have a look at the definitions of Atomic-Consistent-Isolated-Durable and Consistent-Available-Partition-tolerant.

ACID & CAP - A reminder

ACID properties were identified in the 70’s, then the term ACID was coined in 1983. We have:

A for Atomicity
C for Consistency
I for for Isolation
D for Durability

Reminder: the ACID properties were defined in

California during the 70’s.

The CAP theorem was conjectured by Eric Brewer in 2000, and proved in 2002 by Seth Gilbert and Nancy Lynch. We have:

C for Consistency
A for Availability
P for Partition-tolerance.

CAP and ACID wording - the confusion summarized

First, let’s see the global picture.

Word	Databases	CAP	Confusion
Transaction	A set of operations.	The word and the concept are not used.	No
Durable	“Once a transaction completes successfully, its changes to the state survive failures.” [D2]	The word and the concept are not used.	No
Consistent	Integrity constraints on the data (data types, relations, …)	For CAP, Consistency is a shortcut for “Atomic Consistency”. The atomic consistency is a consistency model. More on this later.	Same word, different concepts
Isolation	“Even though transactions execute concurrently, it appears to each Transaction, T, that others executed either before either after T”. [D2]	The word is not used in CAP but the word Isolation as defined in ACID is a consistency model in CAP vocabulary.	Different words but same concept
Atomic	All changes happen or none happen.	For CAP, Atomic is a consistency model, the one used in the CAP proof.	Same word, different concepts
Available	Concept not often used. If so, the definition can be different than in CAP, i.e. available may not require all the non-failing nodes to respond.	“every request received by a non-failing node in the system must result in a response” [C2]	Same word, same concept, different definitions
Partition	Concept not often used. If so, the definition is the same as in CAP.	Two sets of nodes are partitioned when all messages between them are lost.	No

Now, let’s dig into some details: we’re going to find out that there are some extra confusion sources for distributed databases.

Transaction (only in ACID)

A transaction is a set of operations. Any of these operations can read or write multiple data. ACID is about giving to this set of operations the same properties as if it was a unique operation. That’s not an objective for CAP. CAP is about the possible properties of multiple operations using the same data, possibly replicated.

Durability (only in ACID)

“Once a transaction completes successfully, its changes to the state survive failures” is quite clear, but leaves the failure description to the physical deployment. This one depends mostly on redundancy: multiple disks on a single node and/or multiple nodes and/or multiple sites. “Survive” does not imply any notion of availability: it means that it will be at least possible to restore the data later.

CAP itself does not mention durability. The durability in CAP is implicit: CAP is about partitions, not node failures.

Available-in-CAP

In CAP, being available means all non-failing nodes continue to serve requests if the system is partitioned. Many distributed systems will consider themselves as available if, when there is a partition, some non-failing nodes continue to serve requests. These systems are not available-in-CAP.

Consistency-in-CAP and Atomic-in-CAP

Consistency in CAP is a shortcut for Atomic Consistency. Atomic consistency is a consistency model. And a consistency model describes how the operations of a system can be ordered. The list of operations depends on the system. For example, it’s possible to define the consistency model of a transactional system, hence saying that ‘commit’ is one of the operations. CAP proof is done on a distributed shared memory model defined by Lynch in [V6] and uses read/write/ack.

The choice of a consistency model is far from trivial. There are many consistency models because there are many possible trade-offs between:

How easy it is to use the consistency model. It may also depend on the application itself : some consistency models may be easier to use for some applications than for others.
How efficient is the implementation of the memory model. This may also depend on the hardware and the physical deployment in general.

The consistency models used in ACID and CAP are actually simple:

Sequential consistency, as defined by Lamport [V9]: “the program behaves as if the memory accesses of all processes were interleaved and then executed sequentially.”
Atomic Consistency (also called linearizability) is sequential plus a real-time constraint: “Unlike sequential consistency, linearizability implicitly assumes the notion of an observable global time across all processes. Operations are modeled by an interval which consists of the period of time between the invocation and response for the operation and each operation is assumed to take effect instantaneously at some point within this interval.” [V7]

CAP says “Consistent” for Atomic Consistency: it’s just a shortcut. Atomic-in-CAP (ordering of operations) and Atomic-in-ACID (all or nothing) are not the same thing at all.

Consistency-in-ACID

Consistency in ACID relates to data integrity. For example, it is often possible to implement these rules in a SQL database:

this field is not null
this field is a number
this field is an existing reference to another field in another table.

The database will not let you commit a transaction that breaks a constraint. That’s the consistency-in-ACID contract. This consistency definition has no equivalent in CAP.

It’s important to remember that a database, SQL or not, does not implement all the consistency constraints. Let’s quote Gray and Reuter: [D2] “the system underneath has no means for checking all the consistency constraints there are. Most of them are not formalized in the first place.”

And to conclude on Consistency-in-ACID, let’s quote again Gray and Reuter [D2] “It is important to keep in mind that the consistency definition that comes with the transaction paradigm is largely a syntactic one.”

Atomic-in-ACID

The “all-or-nothing” behavior is quite intuitive: for example, with such a transaction, simulating a transfer between two accounts in pseudo-code:

begin

val1 = read(account1)

val2 = read(account2)

newVal1 = val1 - 100

newVal2 = val2 + 100

write(account1, newVal1)

write(account2, newVal2)

commit

Atomic means that either the two accounts will be updated or none of them. If the write to account1 succeeds then the write to account2 fails, the two writes will be rollbacked.

However, being atomic-in-ACID does not mean that this transaction is isolated from the others. In other words, you can claim to be atomic-in-ACID even if:

the values written are visible to others before the transaction is actually committed.
the values read can be modified by others during the transaction. If you read the same value multiple times you may get different results.

For example, with the transaction proposed above, you can expect this behavior with many SQL databases:

start with 1000 on account1, 0 on account2
run two transfers in parallel (the code above)
while you’re expecting 800 on account1 and 200 on account2, you get 900 on account1.

This because Atomicity is different than isolation in ACID: Atomic/all-or-nothing does not mean isolated.

Isolation

The specification of isolation given by Gray and Reuter in [D2] is: “Even though transactions execute concurrently, it appears to each Transaction, T, that others executed either before either after T”. This defines a consistency model, like the consistency of CAP. With this definition, any transaction is fully isolated. It’s easy to understand and to use.

Isolation in theory: the developer has nothing

to do thanks to the serializability property

However, this was difficult to implement efficiently, and databases had to relax the constraints. As a consequence, isolation in an ACID database comes in several degrees, namely “serializable”, “repeatable read”, “read committed”, and “read uncommitted”.

The one most commonly used by default is “read committed”: a transaction sees only data that has been committed.

While this looks simple, there are a few catches:

These definitions were made with an implementation in mind. As Hellerstein, Stonebraker and Hamilton said [D1] : “Both rely in subtle ways on an assumption that a locking scheme is used for concurrency control, as opposed to an optimistic or multi-version concurrency scheme. This implies that the proposed semantics are ill-defined.”
For a given isolation degree, all the databases do not have the same behavior. This is explained by Martin Kleppmann in [D4].
The isolation impacts not only functional correctness, but the technical correctness as well: most databases use locks behind the scene. Concurrent executions can create deadlocks: unexpected or unmanaged dependencies between independent transactions create situations in which all transactions need another transaction to free their own resources. In such a situation, one of the transactions is stopped by the database engine and fails, even if this transaction is functionality and syntactically correct.

At the end of the day, database users need to understand far more than the database consistency model: they need to understand how it is actually implemented by the database engine they are using.

Let’s look at our example above again. If used on a database with “read committed” (a common default), we may have an incorrect result. We may actually generate money. This can be fixed by locking the value explicitly. The pseudo-code for a transaction in “read committed” mode for a database engine using locks could be:

begin

val1 = readAndLock(account1)

val2 = readAndLock(account2)

newVal1 = val1 - 100

newVal2 = val2 + 100

write(account1, newVal1)

write(account2, newVal2)

commit // release all locks

With this comes a lot of the complexity already met in concurrent programming languages like Java. Databases actually add an extra level of complexity as the lock range can be wider (pages, table, …) and can be changed dynamically (escalated) by the database engine. Moreover, deadlocks or performance requirements can lead to use “read uncommitted” for some transactions that are supposed to read only immutable data. This can lead to complex issues if someone, somewhere (let’s say, a professional service expert on a customer’s site) modifies this theoretically immutable data while the system is running.

Isolation in practice, the database role being played by the gnu.

We see that the ‘C’ of CAP and the ‘I’ of ACID are very similar. But CAP being a theorem can stick to the model, while performance constraints forced databases to add multiple levels of parameterization, and forced the database users to understand how the isolation is actually implemented.

Conclusion

Of the 4 letters in the ACID acronym, 3 of them have a different meaning than their equivalent in CAP. No wonder this is confusing. Also, the confusion does not only come from the overlaps in the wording, but in the necessity to go into the implementation details to understand the difference in a real-life concurrent application. In practice, this is one of the difficulties that had led many people (me included) to look at the ‘NoSQL’ world.

See here the Hacker News discussion on this post.

	This post is part of the CAP theorem series
⇐ Eventually Consistent vs. CP		CAP and Node Failures ⇒

References

[C2] Gilbert and Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News (2002)

[C6] Michael Stonebraker, Clarifications on the CAP Theorem and Data-Related Errors, 2010

[D1] J. M. Hellerstein, M. Stonebraker, J. Hamilton, Architecture of a Database System, Foundations and Trends in Databases Vol. 1, No. 2 (2007) 141–259

[D2] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.

[D4] M. Kleppmann, “Hermitage: Testing the “I” in ACID”, blog post, 2014

[V6] Nancy Lynch, Distributed Algorithms, Morgan Kaufmann, 1996

[V7] Kourosh Gharachorloo, “Memory Consistency Models for Shared-Memory Multiprocessors,” Stanford Technical Report, 1995

[V8] M. Herlihy, J. Wing “Linearizability: A Correctness Condition for Concurrent Objects”, ACM, 1990

[V9]Leslie Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs", IEEE Trans. Comput. C-28,9 (Sept. 1979), 690-691.