Thursday, March 26, 2015

The confusing CAP and ACID wording

CAP and ACID share a common vocabulary: Atomic, Consistent, and so on. But there is a catch: the words are the same but they mean totally different things. CAP comes from the distributed systems theory, while ACID belongs to database systems one.The  Distributed Databases use both CAP and ACID vocabulary, so this obviously creates a lot of confusion. When someone says: “one should not give up consistency” what’s the consistency he is speaking about? Let’s have a look at the definitions of Atomic-Consistent-Isolated-Durable and Consistent-Available-Partition-tolerant.

ACID & CAP - A reminder

ACID properties were identified in the 70’s, then the term ACID was coined in 1983. We have:
  • A for Atomicity
  • C for Consistency
  • I for for Isolation
  • D for Durability


Reminder: the ACID properties were defined in
California during the 70’s.


The CAP theorem was conjectured by Eric Brewer in 2000, and proved in 2002 by Seth Gilbert and Nancy Lynch. We have:
  • C for Consistency
  • A for Availability
  • P for Partition-tolerance.


CAP and ACID wording - the confusion summarized

First, let’s see the global picture.


Word
Databases
CAP
Confusion
Transaction
A set of operations.
The word and the concept are not used.
No
Durable
Once a transaction completes successfully, its changes to the state survive failures.” [D2]
The word and the concept are not used.
No
Consistent
Integrity constraints on the data (data types, relations, …)
For CAP, Consistency is a shortcut for “Atomic Consistency”. The atomic consistency is a consistency model. More on this later.
Same word, different concepts
Isolation
Even though transactions execute concurrently, it appears to each Transaction, T, that others executed either before either after T”. [D2]
The word is not used in CAP but the word Isolation as defined in ACID is a consistency model in CAP vocabulary.
Different words but same concept
Atomic
All changes happen or none happen.
For CAP, Atomic is a consistency model, the one used in the CAP proof.
Same word, different concepts
Available
Concept not often used. If so, the definition can be different than in CAP, i.e. available may not require all the non-failing nodes to respond.
“every request received by a non-failing node in the system must result in a response” [C2]
Same word, same concept, different definitions
Partition
Concept not often used. If so, the definition is the same as in CAP.
Two sets of nodes are partitioned when all messages between them are lost.
No


Now, let’s dig into some details: we’re going to find out that there are some extra confusion sources for distributed databases.


Transaction (only in ACID)

A transaction is a set of operations. Any of these operations can read or write multiple data. ACID is about giving to this set of operations the same properties as if it was a unique operation. That’s not an objective for CAP. CAP is about the possible properties of multiple operations using the same data, possibly replicated.


Durability (only in ACID)

Once a transaction completes successfully, its changes to the state survive failures” is quite clear, but leaves the failure description to the physical deployment. This one depends mostly on redundancy: multiple disks on a single node and/or multiple nodes and/or multiple sites. “Survive” does not imply any notion of availability: it means that it will be at least possible to restore the data later.


CAP itself does not mention durability. The durability in CAP is implicit: CAP is about partitions, not node failures.


Available-in-CAP

In CAP, being available means all non-failing nodes continue to serve requests if the system is partitioned. Many distributed systems will consider themselves as available if, when there is a partition, some non-failing nodes continue to serve requests. These systems are not available-in-CAP.


Consistency-in-CAP and Atomic-in-CAP

Consistency in CAP is a shortcut for Atomic Consistency. Atomic consistency is a consistency model. And a consistency model describes how the operations of a system can be ordered. The list of operations depends on the system. For example, it’s possible to define the consistency model of a transactional system, hence saying that ‘commit’ is one of the operations. CAP proof is done on a distributed shared memory model defined by Lynch in [V6] and uses read/write/ack.


The choice of a consistency model is far from trivial. There are many consistency models because there are many possible trade-offs between:
  • How easy it is to use the consistency model. It may also depend on the application itself : some consistency models may be easier to use for some applications than for others.
  • How efficient is the implementation of the memory model. This may also depend on the hardware and the physical deployment in general.


The consistency models used in ACID and CAP are actually simple:
  • Sequential consistency, as defined by Lamport [V9]: “the program behaves as if the memory accesses of all processes were interleaved and then executed sequentially.
  • Atomic Consistency (also called linearizability) is sequential plus a real-time constraint: “Unlike sequential consistency, linearizability implicitly assumes the notion of an observable global time across all processes. Operations are modeled by an interval which consists of the period of time between the invocation and response for the operation and each operation is assumed to take effect instantaneously at some point within this interval.” [V7]


CAP says “Consistent” for Atomic Consistency: it’s just a shortcut. Atomic-in-CAP (ordering of operations) and Atomic-in-ACID (all or nothing) are not the same thing at all.


Consistency-in-ACID

Consistency in ACID relates to data integrity. For example, it is often possible to implement these rules in a SQL database:
  • this field is not null
  • this field is a number
  • this field is an existing reference to another field in another table.


The database will not let you commit a transaction that breaks a constraint. That’s the consistency-in-ACID contract. This consistency definition has no equivalent in CAP.


It’s important to remember that a database, SQL or not, does not implement all the consistency constraints. Let’s quote Gray and Reuter: [D2] “the system underneath has no means for checking all the consistency constraints there are. Most of them are not formalized in the first place.


And to conclude on Consistency-in-ACID, let’s quote again Gray and Reuter [D2] “It is important to keep in mind that the consistency definition that comes with the transaction paradigm is largely a syntactic one.


Atomic-in-ACID

The “all-or-nothing” behavior is quite intuitive: for example, with such a transaction, simulating a transfer between two accounts in pseudo-code:


begin
   val1 = read(account1)
   val2 = read(account2)
   newVal1 = val1 - 100
   newVal2 = val2 + 100
   write(account1, newVal1)
   write(account2, newVal2)
commit


Atomic means that either the two accounts will be updated or none of them. If the write to account1 succeeds then the write to account2 fails, the two writes will be rollbacked.


However, being atomic-in-ACID does not mean that this transaction is isolated from the others. In other words, you can claim to be atomic-in-ACID even if:
  • the values written are visible to others before the transaction is actually committed.
  • the values read can be modified by others during the transaction. If you read the same value multiple times you may get different results.


For example, with the transaction proposed above, you can expect this behavior with many SQL databases:
  • start with 1000 on account1, 0 on account2
  • run two transfers in parallel (the code above)
  • while you’re expecting 800 on account1 and 200 on account2, you get 900 on account1.


This because Atomicity is different than isolation in ACID: Atomic/all-or-nothing does not mean isolated.


Isolation

The specification of isolation given by Gray and Reuter in [D2] is: “Even though transactions execute concurrently, it appears to each Transaction, T, that others executed either before either after T”. This defines a consistency model, like the consistency of CAP. With this definition, any transaction is fully isolated. It’s easy to understand and to use.


Isolation in theory: the developer has nothing
to do thanks to the serializability property


However, this was difficult to implement efficiently, and databases had to relax the constraints. As a consequence, isolation in an ACID database comes in several degrees, namely “serializable”, “repeatable read”, “read committed”, and “read uncommitted”.


The one most commonly used by default is “read committed”: a transaction sees only data that has been committed.


While this looks simple, there are a few catches:
  1. These definitions were made with an implementation in mind. As Hellerstein, Stonebraker and Hamilton said [D1] : “Both rely in subtle ways on an assumption that a locking scheme is used for concurrency control, as opposed to an optimistic or multi-version concurrency scheme. This implies that the proposed semantics are ill-defined.
  2. For a given isolation degree, all the databases do not have the same behavior. This is explained by Martin Kleppmann in [D4].
  3. The isolation impacts not only functional correctness, but the technical correctness as well: most databases use locks behind the scene. Concurrent executions can create deadlocks: unexpected or unmanaged dependencies between independent transactions create situations in which all transactions need another transaction to free their own resources. In such a situation, one of the transactions is stopped by the database engine and fails, even if this transaction is functionality and syntactically correct.


At the end of the day, database users need to understand far more than the database consistency model: they need to understand how it is actually implemented by the database engine they are using.


Let’s look at our example above again. If used on a database with “read committed” (a common default), we may have an incorrect result. We may actually generate money. This can be fixed by locking the value explicitly. The pseudo-code for a transaction in “read committed” mode for a database engine using locks could be:


begin
   val1 = readAndLock(account1)
   val2 = readAndLock(account2)
   newVal1 = val1 - 100
   newVal2 = val2 + 100
   write(account1, newVal1)
   write(account2, newVal2)
commit // release all locks


With this comes a lot of the complexity already met in concurrent programming languages like Java. Databases actually add an extra level of complexity as the lock range can be wider (pages, table, …) and can be changed dynamically (escalated) by the database engine. Moreover, deadlocks or performance requirements can lead to use “read uncommitted” for some transactions that are supposed to read only immutable data. This can lead to complex issues if someone, somewhere (let’s say, a professional service expert on a customer’s site) modifies this theoretically immutable data while the system is running.


Isolation in practice, the database role being played by the gnu.


We see that the ‘C’ of CAP and the ‘I’ of ACID are very similar. But CAP being a theorem can stick to the model, while performance constraints forced databases to add multiple levels of parameterization, and forced the database users to understand how the isolation is actually implemented.


Conclusion
Of the 4 letters in the ACID acronym, 3 of them have a different meaning than their equivalent in CAP. No wonder this is confusing. Also, the confusion does not only come from the overlaps in the wording, but in the necessity to go into the implementation details to understand the difference in a real-life concurrent application. In practice, this is one of the difficulties that had led many people (me included) to look at the ‘NoSQL’ world.


See here the Hacker News discussion on this post.





References

[C2] Gilbert and Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News (2002)
[C6] Michael Stonebraker, Clarifications on the CAP Theorem and Data-Related Errors, 2010
[D1] J. M. Hellerstein, M. Stonebraker, J. Hamilton, Architecture of a Database System, Foundations and Trends in Databases Vol. 1, No. 2 (2007) 141–259
[D2] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.
[D4] M. Kleppmann, “Hermitage: Testing the “I” in ACID”, blog post, 2014
[V6] Nancy Lynch, Distributed Algorithms, Morgan Kaufmann, 1996
[V7] Kourosh Gharachorloo, “Memory Consistency Models for Shared-Memory Multiprocessors,” Stanford Technical Report, 1995
[V8] M. Herlihy, J. Wing “Linearizability: A Correctness Condition for Concurrent Objects”, ACM, 1990
[V9]Leslie Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs", IEEE Trans. Comput. C-28,9 (Sept. 1979), 690-691.

3 comments:

  1. Hi Nicolas, thanks for this really good article. I personally think that because of all those differences between the terms (which you've just described), we shouldn't consider ACID in any way when discussing NoSQL. ACID was created back in the day when the systems were hosted on single machines, so they didn't even think about distributed processing and it only confuses people nowadays. I have one question however (since contrary to my opinion ACID is sometimes used when describing NoSQL): do you think that it is possible for NoSQL databases to simultaneously guarantee all of ACIDs properties?

    ReplyDelete
    Replies
    1. Hi Pawel,

      Thanks for the nice feedback.
      To me, using ACID as a binary yes/no is a little bit like CAP categories: it gives a taste of what the database authors are aiming at but not much more.
      Distributed databases with ACID do exist: SQL + 2PC is exactly this, but it comes with so many drawbacks (no real tolerance to partition, limited performances, ...) that it cannot be a general solution. And all distributed database have chosen different drawbacks so far.

      As such, if you want to select or use a distributed database you can't use ACID as a boolean value, but using ACID (and CAP) to study the database behavior helps a lot to identify the trade-offs actually made. For example:

      Atomic:
      - many (Bigtables clones) are atomic by row, and rows have to fit on a single node.
      - Some (voltDB) do more, but then the actual cost of cross-nodes operation can by quite high.

      Consistent:
      - Some constraints are easy to implement (checking data types for example)
      - Foreign key checks are expensive on big data systems. Some try to keep them outside of the DB. There is a paper by Peter Baillis (Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity), with some 'fun' (but expected) findings. (it seems it's not available anymore. As the conf takes place next week, it should be back soon).

      Isolation
      - Between all the default levels + snapshot isolation, which one is efficiently implemented by the DB? Is it implemented with locks all over the place, forfeiting concurrency?

      Durability
      - The wildest one. A lot of NoSQL databases forfeit durability to get better performances: they write on a single node instead of 2 or more or flush the buffer asynchronously or keep the data in the client buffer.

      So the boolean ACID & CAP are very limited but taking each letter individually and asking for each one "What does it mean exactly?" and "what are the drawbacks?" is quite ok to understand a distributed system.

      Delete
  2. Thanks for a very informative article. Given that the definition and context ACID relates to relational databases, should we in fact be more concerned about BASE versus ACID in terms of equivalence?

    ReplyDelete