Monday, May 25, 2015

CAP: if all you have is a timeout, everything looks like a partition

I suppose it is tempting, if the only tool you have is a timeout, to treat everything as if it were a partition.” - Abraham Maslow, quoted from memory

CAP Theorem series
This post is part of the CAP theorem series. You may want to start by my post on ACID vs. CAP if you have a database background but have never really been exposed to the CAP theorem. The post discussing some traps in the ‘Availability’ and ‘Consistency’ definition of CAP should also be used as an introduction if you know CAP but haven’t looked at its formal definition.

Partitions in the CAP theorem are network partitions, and nothing else. There’re not node failure or process crash. But does this matter in real-life? Let’s answer this question by looking at an optimisation implemented on Apache HBase and Apache ZooKeeper when handling process crash. This is applicable by anyone using ZooKeeper.

Apache HBase, ZooKeeper and failure detection

Using ZooKeeper website: “ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.

And HBase website: “Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.“ HBase is an open source implementation of Google’s Bigtable. Cluster size varies from 5 to x000 nodes.

ZooKeeper is used by HBase for a few things, one of them being, like in Bigtable, detecting failures. ZooKeeper manages failures by giving the user the possibility to create “ephemeral nodes”. “These znodes exist as long as the session that created the znode is active. When the session ends the znode is deleted.” And then “Expirations happen when the cluster does not hear from the client within the specified session timeout period (i.e. no heartbeat). At session expiration the cluster will delete any/all ephemeral nodes owned by that session and immediately notify any/all connected clients of the change (anyone watching those znodes).

Ephemeral znodes are a key element of many use cases for ZooKeeper, such as distributed locks. The timeout means that if the process holding the lock crashes or is partitioned, the lock will be freed only after a few dozen seconds--typically 20 seconds.

HBase uses ZooKeeper and ephemeral znodes for failure detection (while the exclusion mechanism is not in ZooKeeper, as described in [V10]). The ZooKeeper timeout represents the major cost for MTTR in a failure scenario, and this for any type of failure: process crash, node failure, network partitions.

A tortoise can be fast--sometimes.

Network partition compared to process crash

There are two issues with network partition. First, it is complex to detect. Second, by definition, you don’t know what’s going on on the other side of the partition: it can be dead (then the CAP theorem does not apply), or not (then you have to choose between consistency and availability).

If all you have is a timeout, then you have no information about what happened: the process might be dead, might be definitively partitioned… or might be already back to life and kicking. A ‘partition-tolerant’ application has to expect the worse.

But a process crash is much simpler than a partition:
  • The node itself is still there, hence it can communicate the process’ death immediately.
  • Since the process is dead there is no risk to have a split brain situation.

So we can distinguish both cases we have an easy improvement.

Implementation details

There is a simple implementation here: HBase, as many Unix applications are, is launched with a shell script. The last line of the script launches the process. So we just have to add an ultimate line to this script to request ZooKeeper to delete the ephemeral znode. That’s implemented by the HBase scripts.

So simple?

Nearly. The script things is a little bit more complex than what I just said, because the processes are launched as background tasks. So the work in the script is in the order of 20 lines rather than just one.

More importantly, it also means that the software looking after the node timeout must have an API to allow an external update of the nodes status. It’s the case with ZooKeeper with this ephemeral znode deletion, even if it’s a little bit hacky: it would be better to expire the session instead of deleting a fixed znode.


ZOOKEEPER-922 is a very interesting change request. I copy/paste the plain clear description here: “In the case when a client connection is closed due to socket error instead of the client calling close explicitly, it would be nice to enable the session associated with that client to timeout faster than the negotiated session timeout. This would enable a zookeeper ensemble that is acting as a dynamic discovery provider to remove ephemeral nodes for crashed clients quickly, while allowing for a longer heartbeat-based timeout for java clients that need to do long stop-the-world GC.” These socket errors are clear because if the socket is closed while the remote node is still up, you do have immediate and (somehow) clear exceptions. So this is also leveraging the fact that the node is still up.

That’s exactly what we want. But change request contains also the analysis of the possible solutions:
  • “How do you deal with the following race condition?”
  • “There may be a way to do this reliably, but we need to vet the design first.”
  • “What happens if the client disconnects from one server and moves to another?”

This is CAP in action. A closed socket does not imply that the remote process is dead. Maybe the process is dead--that’s the case you want to optimize for, but you’re not sure, and sometimes it’s not dead (in this case, because the ZooKeeper client can try to connect to another node). The analysis is complex, error prone and time consuming. The ZooKeeper team tends to hold the commit when in doubt, and this change request is still open.

Is adding this extra case for process crash useful?

If your processes never crash, or are never killed by the operating system or by the administration team as a last resort, it’s not very useful. Also, if having walking-dead nodes in the cluster is not an issue, early death detection does not help much--this is more a CP thing than an AP one. But remember that most eventual consistency databases around have “tunable consistency” and can be used in a CP mode.
Interestingly, both Google’s Bigtable and Apache HBase teams added this independently to their implementation--it is not present in the original Bigtable paper. At Facebook--a large HBase user, it’s also reported as being very useful when doing a rolling upgrade: if anything goes wrong, it’s possible to kill9 the suspicious processes without impacting availability. Immediate notification, hence recovery, simplifies system administration. [V12] also mentions “a technique that does involve killing, which is used to increase the availability of Web servers and other services, is to deploy a local script that periodically checks if the application process is running. If not, or if the process has erratic behavior (such as very high CPU usage), the script restarts the application, killing it first if necessary.” This technique can be generalized if a crash is detected immediately.


Consensus servers, like ZooKeeper, rely on “unreliable failure detector”, as defined by Chandra and Toueg in 1996 [V3]. The simplest and most common design for these failure detectors is timeout-based: that’s the one ZooKeeper comes with.  But there are a lot of performance benefits found when improving these failure detectors. Here we saw a cheap crash detector. But some network equipment can help as well [V11], with callbacks to allow for integration with your failure detector. Falcon [V12], “coordinating a network of spies, each monitoring a layer of the system” is also an interesting approach. Once again, this means that openness is a key requirement for a failure detector.

Keep in mind that the ‘any failure is a partition’ motto hides many possible improvements. Distinguishing the different cases and using all available information from all possible sources is key to improve system availability.

CAP Theorem series
This post is part of the CAP theorem series


[V3] T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, 1996.
[C7] Henry Robinson’s faq on CAP, 2013
[V10] D. Borthakur and al, “Apache Hadoop Goes Realtime at Facebook”, SIGMOD 2011
[V11] Benoit Sigoure, “Fast Server Failover for Hadoop & HBase”, HBase meetup, 2013
[V12] Leners and Al. “Detecting failures in distributed systems with the FALCON spy network” SOSP ’11

Tuesday, May 12, 2015

Eventual Consistency and Durability

“A son can bear with equanimity the loss of his father, but the loss of his data may drive him to despair.” (Machiavelli, quoted from memory)

While traditional databases are ACID, with the ‘D’ meaning Durability, the NoSQL databases are mainly described by their memory model, i.e. Strong Consistency vs. Eventual Consistency. We're going to see that durability is not something you can forget about so easily.

Definition of Eventually Consistent

Eventual Consistency was defined by Werner Vogels in [E2] "the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value."

It’s one definition among many. Baillis and Ghodsi mention for example [E3] Informally, it guarantees that, if no additional updates are made to a given data item, all reads to that item will eventually return the same value.” This definition only says that the value will converge to something (say ‘42’). In 1995, Terry used [V5] All servers move towards eventual consistency. That is, the Bayou system guarantees that all servers eventually receive all Writes.”.

As most implementations available today are inspired by the Dynamo paper and Vogels’ works, it’s safer to stick to Vogels’ definition, who knew them all, and has chosen his words carefully. In the CAP series, I explained the importance of this definition.

And finally, this consistency is not the same thing as the consistency in ACID.

Using an eventually consistent database

So if we have an eventually consistent key-value database, we can implement a client that would be doing something like this for example:

ecStorage.put(“anId”, “someData”);
while (ecStorage.get(“anId”) == null) sleep 1 second;

We know that this program will finish: even if the first “gets” may not see our write, eventually they will see it.

Eventually Consistent database - possible implementation

The most usual implementation is the one described once again by Werner Vogels, once again in his [E2] paper:
N = The number of nodes that store replicas of the data.
W = The number of replicas that need to acknowledge the receipt of the update before the update completes.
R = The number of replicas that are contacted when a data object is accessed through a read operation.

A typical value for N is 3: the data is replicated on 3 nodes asynchronously or not, depending on the value of W.
Values for W and R vary between 1 and 3, depending on what you want in terms of performance and consistency.
If we write with W=1, the initial write goes to a single node. The write is finished, from a client point of view, when a node stores the data. The data is then asynchronously written to the other two nodes.

Eventual Consistency, failures and dataloss

So, if we try our program on a dynamo key-value database, with W=1 and R=1, what happens?
The simplest scenario is, with 3 nodes, N1, N2, N3.

Client does a put on N1
Client does a get on N1
The program finishes.

Another one would be:
Client does a put on N1
Client does a get on N2: receives null
Client does a get on N1
The program finishes.

Or, also:
Client does a put on N1
Client does a get on N2: receives null
In parallel, the data was replicated from N1 to N2
Client does a get on N2: receives the data
The program finishes.

These scenarios are all fine. What if there is a failure? After all, that’s what we’re here for!
We could have, for example:
Client does a put on N1
N1 crashes after the put but before the replication
Client does a get on N2: receives null
Client does a get on N3: receives null

Will this end someday? Well, if N1 does not come back to life, no, as N1 failed before the data was replicated. In other words, you wrote an application relying on an eventually consistent storage, but a Dynamo implementation with W=1 is not an eventually consistent storage. More generally, a storage system that can lose data cannot be called an eventually consistent storage system.
Remember: failures do happen. That’s why you’ve chosen
eventual consistency in the first place

So what?

This shows a strong link between consistency and durability. There is no dataloss in a consistency models. It’s different for databases, hence the need for the ‘D’ in ACID. This is especially important in the ‘NoSQL’ / ‘big data’ worlds, as they they are deployed with commodity servers that will fail more often than your usual high-end server.

Is this a real world issue?

Well, hopefully if people spend so much time describing consistency models it’s because it’s useful and used in real life. The idea is that the developer knows the consistency model and can use it when writing applications. As such, the hidden implication of W=1 should be explicit when describing the dynamo model.

This requirement to write synchronously to multiple nodes to get eventual consistency is important. If the application needs to be partition tolerant, it also means that it must write synchronously to multiple racks, thus dramatically increasing the write latency as compared to a single node write.

Between datacenters, synchronous writes become costly. Asynchronous replication will solve the performance part, with the loss of some data and the loss of the eventual consistency if a disaster occurs in one datacenter. Eventual consistency cannot be used to represent the system as a whole. It’s something that Stonebraker already pointed out in [C10]: "Hence, eventual consistency cannot be guaranteed, because a transaction may be completely lost if a disaster occurs at a local cluster before the transaction has been successfully forwarded elsewhere."

This post is part of the consistency series.


[E1] G. DeCandia et al., "Dynamo: Amazon's Highly Available Key-Value Store," Proc. 21st ACM SIGOPS Symp. Operating Systems Principles, 2007
[E2]  Werner Vogels, Eventually Consistent, ACM Queue, Vol.6 No.6, 2008
[E3] Peter Bailis, Ali Ghodsi, UC Berkeley, Eventual Consistency Today: Limitations, Extensions, and Beyond, communications of the ACM vol. 56, 2013
[C10] Michael Stonebraker, Errors in Database Systems, Eventual Consistency, and the CAP Theorem, acm blogs, 2010
[V5] Doug Terry and al, "Managing Update Conflicts in Bayou, A Weakly Connected Replicated Storage System", SOSP 1995

The Consistency Series

“Those are my consistency models, and if you don't like them... well, I have others.” - Groucho Marx, quoted from memory

The CAP theorem series is coming to an end. I have a few posts planned, but they will be mainly about wrapping things-up-- I hope these posts will be great, though.

Anyway, here starts a new series: the consistency series. This series is about getting into some practical details about the consistency models and their implementations. You think that the trade-off is only or mainly on strong consistency vs. eventual consistency? That everything is said with “W+R” vs. “N”? Then read this series.

Settling on the color may not be enough when choosing a car.

Posts, in recommended reading order

The posts published so far are:
  • Eventual Consistency and Durability: shows the link between durability and consistency, and why traditional databases have the ‘D’ in ACID. Tells you what ‘W=1’ implies in the Dynamo model.