This long run

Thanks for a very informative article. Given that ...

2015-08-10T12:32:06.152+02:00

Thanks for a very informative article. Given that the definition and context ACID relates to relational databases, should we in fact be more concerned about BASE versus ACID in terms of equivalence?

I updated the post to make clear that the partitio...

2015-07-10T12:30:52.353+02:00

I updated the post to make clear that the partitions mentionned here are network partitions. That was clearly confusing. Also, there was no discussion around CP vs. AP. The discussion is around CA/CP. I changed a sentence that was unclear. And I added a few details here and there.

CP and CA overlap because CA does not mean that partition are impossible. Brewer: "It is best to think about this probabilistically: choosing CA should mean that the probability of a partition is far less than that of other systemic failures."

For the data itself, we need the probability of a partition. The paper pointed in the response does a pretty good job at showing that networks are not reliable, but not at estimating probabilities.

And the same paper confirms you can't generalize anyway: "On the other hand, some networks really are reliable. Engineers at major financial firms have anecdotally reported that despite putting serious effort into designing systems that gracefully tolerate partitions, their networks rarely, if ever, exhibit partition behavior."

I stick to my conclusion: “the whole space is useful.”

Curious as to what you think of this response: htt...

2015-07-08T23:34:20.870+02:00

Curious as to what you think of this response: https://aphyr.com/posts/325-comments-on-you-do-it-too

Hi Pawel, For the "why use 2PC at all?"...

2015-06-22T16:33:37.661+02:00

Hi Pawel,

For the "why use 2PC at all?" I don't have a better explanation than the one given by Mark Little ("his additional overhead is not something which the majority of environments, users, use cases etc. are prepared to accept and prefer to use 2PC"). He's an expert (and his book, 'Java Transaction Processing' is really great).

I can just add that these protocols are very complex to implement so people tend to be very conservative. A battlefield proven implementation of an average protocol is safer than a fresh new implementation of a theoretically better protocol. I'm not sure that the decision is always very rational.

For 3PC from a CAP point of view, I'm not sure. Henry Robinson shows a scenario leading to an inconsistency (http://the-paper-trail.org/blog/consensus-protocols-three-phase-commit/). If he is right (and he is usually right) 3PC is CA... It's a very good question anyway: someone using 3PC must know if it can lead to inconsistency or blocking scenarios.

And to be clear, I have read the link you've p...

2015-06-16T23:30:38.382+02:00

And to be clear, I have read the link you've posted above (http://planet.jboss.org/post/2pc_or_3pc). I'm fully aware of 3PC/E3PC overheads, however I'm curious about distributed aspects of using/not using those algorithms instead of 2PC.

Hi Nicolas! I have a question regarding 2PC. Sinc...

2015-06-16T23:18:02.144+02:00

Hi Nicolas!

I have a question regarding 2PC. Since in the case of failure your options (both clearly flawed) are:
a) blocking all resources (as historically 2PC was defined) or
b) heuristically making a decision,
why use 2PC at all? There is non-blocking alternative - 3PC, or even better solution which additionally is partition tolerant - Enhanced 3PC. Why don't just use one of them? I wish you had considered them in your article - it would be another possibility to strengthen our "CAP categorizing" skills :P

Hi Pawel, Thanks for the nice feedback. To me, us...

2015-05-30T15:11:21.486+02:00

Hi Pawel,

Thanks for the nice feedback.
To me, using ACID as a binary yes/no is a little bit like CAP categories: it gives a taste of what the database authors are aiming at but not much more.
Distributed databases with ACID do exist: SQL + 2PC is exactly this, but it comes with so many drawbacks (no real tolerance to partition, limited performances, ...) that it cannot be a general solution. And all distributed database have chosen different drawbacks so far.

As such, if you want to select or use a distributed database you can't use ACID as a boolean value, but using ACID (and CAP) to study the database behavior helps a lot to identify the trade-offs actually made. For example:

Atomic:
- many (Bigtables clones) are atomic by row, and rows have to fit on a single node.
- Some (voltDB) do more, but then the actual cost of cross-nodes operation can by quite high.

Consistent:
- Some constraints are easy to implement (checking data types for example)
- Foreign key checks are expensive on big data systems. Some try to keep them outside of the DB. There is a paper by Peter Baillis (Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity), with some 'fun' (but expected) findings. (it seems it's not available anymore. As the conf takes place next week, it should be back soon).

Isolation
- Between all the default levels + snapshot isolation, which one is efficiently implemented by the DB? Is it implemented with locks all over the place, forfeiting concurrency?

Durability
- The wildest one. A lot of NoSQL databases forfeit durability to get better performances: they write on a single node instead of 2 or more or flush the buffer asynchronously or keep the data in the client buffer.

So the boolean ACID & CAP are very limited but taking each letter individually and asking for each one "What does it mean exactly?" and "what are the drawbacks?" is quite ok to understand a distributed system.

Hi Nicolas, thanks for this really good article. I...

2015-05-29T23:41:58.891+02:00

Hi Nicolas, thanks for this really good article. I personally think that because of all those differences between the terms (which you've just described), we shouldn't consider ACID in any way when discussing NoSQL. ACID was created back in the day when the systems were hosted on single machines, so they didn't even think about distributed processing and it only confuses people nowadays. I have one question however (since contrary to my opinion ACID is sometimes used when describing NoSQL): do you think that it is possible for NoSQL databases to simultaneously guarantee all of ACIDs properties?

I'm not sure I got your point. > If N2 has ...

2015-05-26T20:03:05.120+02:00

I'm not sure I got your point.
> If N2 has died [] then it will fail to read and satisfy step 3.
If N2 is dead there is no risk to be inconsistent: N2 cannot do anything. On a partition you need to choose.

> More specifically, this establishes that nodes do fail in the system, yet availability can be maintained.
The way the proof and the theorem is written is basically "we're not interest by failed nodes: failed nodes and nodes that cannot receive request are excluded from the scope of our analysis." But yes, in real-life we have to manage them (and CAP does not help us here).

> The definition of availability requires “ever...

2015-05-25T23:59:54.884+02:00

> The definition of availability requires “every request received by a non-failing node in the system must result in a response.”

More specifically, this establishes that nodes do fail in the system, yet availability can be maintained. It's says you must deal with node failure, and the primary failure mode is failure-to-respond. The specific failure behind failure-to-respond is more of an engineering concern than theoretic.

> The definition of availability requires “eve...

2015-05-25T23:52:47.143+02:00

> The definition of availability requires “every request received by a non-failing node in the system must result in a response.” So, a node failure is, by definition, outside the scope of the CAP theorem.

Well, it's potentially outside the scope of the definition of availability. But it's not clear here that non-failing refers specifically to hardware failure, VM failure, process failure, network failure, etc. It reads to me like the author is carving out some leeway for sensibility in saying a node must respond. That *of course* a failing (in some sense) node can't respond.

> This proof relies on having code running on both sides of the partition. If N2 is dead instead of being partitioned, the scenario cannot exist.

That's not how I read it. Step 3 merely requires that N2 fail to read. If N2 has died, lost its network connection, or paused for 1 year, then it will fail to read and satisfy step 3.

Thanks for commenting, Charles. I will have a look...

2015-04-29T16:56:39.840+02:00

Thanks for commenting, Charles. I will have a look at the HN thread as well.

> First [...] you don't actually define the semantics
This part looks only at a partition between the web server and the data store.
Partitions within the data store are looked at here: http://blog.thislongrun.com/2015/04/cap-availability-high-availability-and_16.html

> Second [...] In that case, why use 2PC at all?
Heuristic decisions are in the XA standard since its first version (1991). YMMV, but they are very often used (to say the least) in 2PC production systems. See for example how Mark Little describes them: http://planet.jboss.org/post/2pc_or_3pc. Not really presented as an optional thing.
It shows traditional databases are not that 'CP at all cost' when it comes to managing failure.

> Your conclusion is correct [...] the resulting algorithm is not C or A
Yeah... I see 2PC as not partition-tolerant as a partition breaks acid-atomicity. Once partition intolerance is accepted, CA fits well: 2PC is consistent & available until there is a partition. Saying '2PC is not consistent and not available but is partition tolerant' is not false technically but it's a much less accurate description.

> Third [...] datacenters full of commodity servers [...]
> redundant NICs and cables, located next to each other
> [...] It isn't hard to believe that you could build a system, like that,
> in which you could add algorithms that were CA

I just totally agree with you here. CAP as a categorization tool is used for all types of distributed systems, but there is a huge difference between an application built on commodity hardware running in a multi-dc config and a 2 nodes system running on redundant hardware in a single rack.
Typically, 2PC is historically used on very specific applications: few nodes, a limited number of operations between the nodes, expensive redundant hardware all over the place, limited scaling out needs (if any). Not your typical big data system.

Cross posting my notes from HN in case you aren...

2015-04-25T21:13:35.600+02:00

Cross posting my notes from HN in case you aren't reading comments there:

First, in the "The same application with an Eventually Consistent store", you don't actually define the semantics of the system, so there is no way to say how it relates to CAP. Eventually consistent stores, like the original Dynamo, are designed to be AP systems, which is why they have things like read-repair. The requirement for "A" is that requests actually received by non-failing nodes are responded to, so as long as any replica is up, you will get a response, it's just that it may not be the consistent response.

Second, your argument about two-phase commit is combining two different arguments. First, the standard 2PC algorithm does not allow for heuristic commits. That algorithm is a CP algorithm (it is, in fact, referred to as the "unavailable protocol" by Pat Helland due to it's inability to make progress in any failure conditions). If you add in heuristic commits, then it becomes a disaster! I can imagine some situations where such a protocol would be useful, but not if you care at all about your data consistency. In that case, why use 2PC at all? Your conclusion is correct for the algorithm with heuristics: the resulting algorithm is not C or A.

Third, your confusion about CA is that you are trying to apply it to systems, like datacenters full of commodity servers and switches, that don't really fit it. Imagine, instead, a system of two computers connected by multiple, redundant NICs and cables, located next to each other and communicating by sending messages. It isn't hard to believe that you could build a system, like that, in which you could add algorithms that were CA: in other words, if someone cuts the cable, then all bets are off, but as long as those cables remain uncut, the algorithm guarantees consistency and availability. A better example is cache coherency protocols in modern processors, which are incredibly complex distributed algorithms working on CPUs that communicate over tightly integrated channels. Cache coherency protocols need to be CA, of course! If you somehow managed to severe the communication links without destroying the motherboard, you could break the algorithms assumptions, but that wouldn't make it any less of a CA algorithm.

Thanks for the feedback. > The intuition you&#...

2015-03-31T10:03:03.936+02:00

Thanks for the feedback.

> The intuition you've quoted is right and
> I'm not quite sure why you think it's wrong.
Because the CAP proof does not hold if the node is dead instead of partitioned :-). A part of this post is just about "applying the right theorem to the right problem". One should not use CAP when reasoning about dead nodes.

> That's true *in this failure mode*, but the choice
> of “Availability” in the CAP theorem is much stronger
I agree. Here I'm just speaking for node failures. If you want to support both node failures and network partitions you have some extra work. This post is not about saying CAP is wrong. It's about saying a node failure is not a partition in CAP.

> Now let the {N}-{M} nodes die, and the {M} nodes
> turn back on. What happens to your system?
There are many possible strategies here. A simple one is to consider a node that turns back on as a plain new node. That's what Bigtable/HBase do.

> But again down versus partitioned is indistinguishable
> to any monitoring nodes you have that
> makes a failure and a partition indistinguishable
*Some* failures are indistinguishable from partitions. Many failures are actually distinguishable. In many CP systems, if your monitoring system does the distinction, your MTTR 99 percentile will be improved by a huge factor, just because you will start the recovery sooner instead of relying on a timeout.

> even if partition tolerance and failure are not equivalent,
> you still have to design your system (in CAP terms)
> as if they are,
If partition tolerance and failure are not equivalent:
1) On the theoretical side, it means you can build a CP system that will handle node failures but not network partitions. I don't say you should do that. I say you can, and, if your cluster is 20 nodes max on a LAN, it may be a reasonable choice.
2) You can optimize your system to manage differently the partition case vs. the node failure case. Typically, any lock or resource held by a dead node can be recovered immediately. If it's a partition then deciding on what to do is more complex with more trade-offs.

It's something I plan to write about in another post, but, CAP hides the importance of failure detectors/oracles. For a consistency point of view, you can handle node failures as partitions; but from a latency point of view you need to do the distinction.

http://thislongrun.blogspot.com/2015/03/dead-nodes...

2015-03-31T00:05:11.071+02:00

http://thislongrun.blogspot.com/2015/03/dead-nodes-dont-bite.html

The intuition you've quoted is right and I'm not quite sure why you think it's wrong.

In particular:

>“Now let us turn to single node failures. [...] You simply failover to a replica in
> a transactionally consistent way. Notably, at least Tandem and Vertica have
> been doing exactly this for years.”
>There is nothing more to add. There are real-world systems that are both
>available and consistent when there is a node failure.

That's true *in this failure mode*, but the choice of “Availability” in the CAP theorem is much stronger: being available requires a response as long as you have *any* working nodes in the system. The fact that some systems have consistent replica pairs that they can fail between does not make them both Consistent and Available in the CAP sense — it just makes them Consistent but with more uptime than the naive system.

Looking at [1], which I assume you’re drawing that quote from, it’s clear that Stonebreaker's main point is not that CAP is wrong, but that it encourages the wrong sort of engineering tradeoffs: yes, you might need to choose between C and A eventually, but your systems should be designed to delay the decision as long as possible — and in many applications, users won’t be able to tell the difference. But that doesn’t mean partition-tolerance is not equivalent to fault-tolerance.

In particular, Stonebreaker assumes that any double failures simply won’t happen so you shouldn’t worry about it in your system design. As a developer on a CP system (Ceph) with users who report issues to a mailing list, I am sad to report that for systems of scale he is wrong on this assumption. :(

> In other words, in a strongly consistent system, the intuition is right, or, more
> precisely, actions are taken to make it right. However, in an AP system the
> intuition is just wrong."
In an AP system you are tolerating a partition, by choosing to *not care* whether the nodes are partitioned or not. But a partition and a node failure are still indistinguishable to any monitoring nodes you have.
In a CP system you are choosing to be consistent and partition-tolerant, by guaranteeing that one side will win. But again down versus partitioned is indistinguishable to any monitoring nodes you have.
And that makes a failure and a partition indistinguishable from anybody else’s point of view.

Indeed, you circle back to this in your penultimate section, so I’m not sure why your lede and conclusion say otherwise.

Speaking more generally, even if partition tolerance and failure are not equivalent, you still have to design your system (in CAP terms) as if they are, choosing to give up either C or A:

Let’s say you did try and design a “CA” system. Let it consist of a set {N} nodes. Make some writes {W}. Let the set {M} nodes go down. Make some more writes {X} to the remaining {N}-{M} nodes.

Now let the {N}-{M} nodes die, and the {M} nodes turn back on. What happens to your system? Do the {M} functioning nodes return data based only on {W}, or do they refuse service? Does this simply count as every node being failed until one of {N}-{M} is back?
How do those answers differ from applying the same scenario as a partition?

[1]: http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors