Hard-Core Concurrency Considerations

(This is a snapshot of my old weblog. New posts and selected republished essays can be found at raganwald.com.)

Wednesday, May 09, 2007

Kevin Greer responded to What Haskell teaches us about writing Enterprise-scale software with some insightful emails about the pros and cons of using purely functional data structures for managing concurrency in a multi-processor environment. I’ve reproduced them (with his permission) below.

Now, your first reaction might be, “Ah, Reg is not nearly as smart as he thinks he is.”

If you feel like giving me some credit, you can keep in mind that I was not writing the definitive essay on designing concurrent software, just pointing out some interesting overlaps between what I consider to be the most academic programming language around and the most practical requirements in business applications.

But there’s something more important to take from this.

The original post was a response to someone asking whether there was any value to the stuff you read on programming.reddit.com. His query was whether reading about Haskell was practical. My response was, yes it is, and here are some examples of where functional data structures have an analogue in concurrent data structures. My thesis (that’s a two dollar word for “point”) was that many “impractical” things have a lot to teach us about things we will encounter in the pragmatic “real world.”

Java Concurrency in Practice is written by the people who actually designed and implemented the concurrency features in Java 5 and 6. If you are writing Java programs for two, four, eight, or more cores and CPUs (and isn’t that practically all server-side Java development?), owning and reading this book should be the very next step in your professional development.

Of course, most people read the post and thought, “Cool, some neat stuff about concurrency.” Nothing wrong with that. If you value tips on concurrent programming (and I certainly do), you’ll find Kevin’s emails very insightful.

And if you are still wondering whether you should look at “impractical” and “academic” material like Haskell, Erlang, or other things not being promoted by MegloSoft, consider that one of the papers Kevin cites describes a data structure running on a 768 CPU system. Is this in a University lab somewhere? No, it is in a company that promotes itself as an Enterprise-Scale Java company.

You simply can’t assume that everything the industry tells you to learn is everything you need. Or that any one source (Cool! Raganwald has a new post on Lazy Evaluation) has the definitive answer.

You need to commit to life-long learning to be a software developer. Some of that learning is straightforward MSDN Magazine stuff, simple updates to things you already know. And some of that learning is a little further out on the curve.

Without further ado, here is one of the most comprehensive technical replies to a post on my blog I’ve received to date.

Concurrency Considerations

Hi Reg,

I was just reading your article on concurrent collections and have a few comments:

Just because readers do not update the structure doesn’t mean that they don’t need to synchronize. This belief is a common source of concurrency bugs on multi-processor systems.

Unless you synchronize on the root of your data-structure (or declare it as volatile), you can’t be sure that your cache doesn’t have a stale version of the data (which may have been updated by another processor). You don’t need to synchronize for the entire life of the method, as you would by declaring the method synchronized, but you still need to synchronize on the memory access. You hold the lock for a shorter period of time, thus allowing for better concurrency, but you still have to sync.

If you fail to synchronize your memory (or declare it as volatile), then your code will work correctly on a single CPU machine but will fail when you add more CPU’s (it will work on multi-core machines provided that the cores share the same cache). If you have a recursive datastructure (like a tree) then you will need to actually synchronize on each level/node, unless of course you use a purely functional data-structure, in which case, you’ll only need to synchronize on the root.

Given that you need to make a quick sync anyway, this approach is not much better than just using a ReadWrite-lock (it is slightly better because you can start updating before the readers finish reading (not a big deal for get()’s but potentially a big deal for iterators()), but then again updates are more expensive because of the copying).

I don’t think that you should be using Haskell as a model of “Enterprise-scale” anything. “Enterprise-scale software” usually entails “Enterprise-scale hardware” which means, among other things, multiple-CPU’s. The problem is that Haskell purely-functional model doesn’t support multiple-CPU’s (it’s true, check the literature (except for specialized pipelined architectures, but not in the general case)).

The whole processing “state” is essentially one large purely-functional data-structure. The problem of synchronizing your data-structure appears to be eliminated, but it has really just been moved up to the top-level “state” structure (monad), where the problem is actually compounded. This is worse, because not only would you need to lock your structure in order to make an update, but you would in fact need to lock “the whole world”.

Some people will advocate launching one process per CPU and then using message passing to communicate between them. This is just a very inefficient (many orders of magnitude slower) way of pushing the synchronization issue off onto the OS/Network-stack. (Q: What do multi-core systems mean for the future viability of Haskell?)

You forgot to mention the common technique of using “striping” to improve concurrency. Basically, what you do is create multiple sub-datastructures which each contain only a sub-set of the data. You can partition the data based on the hash of the key being stored. You then wrap the whole thing up so that it still has the same interface as the single data-structure.

The advantage of this approach is that when you use a ReadWrite lock you only need to lock a small portion of the data-structure, rather than the whole thing. This allows multiple updates to be performed in parallel. This is how Java’s concurrent collections work. See: ConcurrentHashMap: Java creates 16 sub-structures by default but you can increase the number when even more concurrency is required.

Have a look at Azul’s non-blocking HashMap implementation. They can do 1.2 billion hashMap operations per second (with 1% of 12 million/sec of those being updates) on a 768 cpu machine (the standard Java ConcurrentHashMap still does half a billion ops/sec which isn’t bad either) . This shows how scalable non-functional methods can be.

I’ve never read of any Haskell or other purely-functional implementation being able to scale to anywhere near these numbers.

There’s another entire world of concurrency control, the world of independent actors communicating with fault-tolerant protocols. The world of Erlang. You can pre-order the most hotly anticipated book on the subject, Programming Erlang: Software for a Concurrent World, the definitive reference from the language's creator, Joe Armstrong.

Summary: Using a purely-functional data-structure does make it easier to support multiple readers, but you still need to sync briefly at the “top” of the structure. Striping has the added advantage of also supporting multiple-writers as well, and in practice, this is a much bigger win. Haskell is inherently limited to a single CPU, which doesn’t match modern hardware; especially of the “enterprise” class. Shared-memory models of concurrency have demonstrated much better performance than purely-functional models.

Best Regards,

Kevin Greer

p.s. I've actually implemented a b-tree structure for very large data-sets and chose to use a purely-functional data-structure myself. My reason for doing so was that I have some very long-running read operations (~8 hours) and I obviously can't afford a ReadWrite lock that's going to starve writer threads for that long.

Another nice advantage of using purely-functional data-structures is that they make it easy to implement temporal-databases that let you perform time-dependent queries.

I just wanted to point out that if all you have is quick-reads then purely-functional is no different than a simple ReadWrite lock and that a super-pure implementation, as with Haskell, doesn't scale to multiple-CPU's or to highly concurrent updates. However, it can be used to good effect in combination with striping or other techniques.

(A tangential note: Java's GC is getting so good in recent releases that P-F data-structures are becoming much more efficient (given that P-F generates more garbage).)

p.p.s. One more advantage of functional data-structures (the largest advantage for me actually):

They map well to log(or journal)-based storage. Functional data-structures never update old data, but instead, just create new versions. If your data-structure uses disk-based storage then this means that you never need to go back and overwrite your file; you only need to append data to the end of the file. This has two advantages:

This works well with the physical characteristics of hard-disks. They have incredibly high transfer rates (10’s of Megs/sec) but very slow seek times (~ 200 seeks/sec if 5ms access time). If you are adding say 1k objects to a data-structure which requires 2 updates on average to a file then you’re looking at about 100 updates per second. If on the other-hand you write all of these updates to the end of the file then you’re looking at say 20,000 updates per second!

You can’t corrupt old data with interrupted writes. The old data is always still there. The only place that a corruption occur is at the end of the file, in which case you just scan backwards until you find the previous good head of your data-structure.

This is fantastic stuff to share, thanks, Kevin!

What other tips can readers contribute for people building highly concurrent software (besides the frequent use of JProbe Threadalyzer, of course)? What online resources (papers, articles, books) do readers recommend for the professional developer?

Labels: java, passion, popular

¶ 2:17 PM

Comments on “Hard-Core Concurrency Considerations”:

That was indeed a very interesting email, but I'm a bit confused regarding his assertion about Haskell. A quick Google turned up a paper at http://research.microsoft.com/~tharris/papers/2005-haskell.pdf that describes a technique for lock-free evaluation of shared thunks that was developed to let GHC run on multiple cores. They quote an average time cost of going from single core to multicore of 6%, much less than the potential gains for parallel evaluation.

Of course, this doesn't solve the problem (introduced by laziness) of trying to preemptively evaluate thunks that might be useful in the future. This is what is needed in order to get real speedup from multicore, but at least the common cases should be caught by the usual strictness analyses.

# posted by

Omega Prime : 4:20 PM

"Unless you synchronize on the root of your data-structure (or declare it as volatile), you can’t be sure that your cache doesn’t have a stale version of the data (which may have been updated by another processor)."

Uh, am I missing something here? This is why cache coherence protocols exist. Or am is this on some sort of weird machine that doesn't make guarantees about your view of memory?

# posted by

Anonymous : 5:41 PM

Huh? I think he's about 5 years out of date on Haskell concurrency research.

The Haskell community is pursuing a two-prong approach to concurrency.

1. Software Transactional Memory replaces the traditional "locks and semaphores" approach. STM is a good fit for Haskell, and the general technique scales to a 100+ cores. (E-mail me and I'll pull together a bibliography.)

2. Nested Data Parallelism, combined with aggressive fusion, provides a lovely programming model for "embarrassingly parallel" tasks. The Haskell implementation is still a work-in-progress, but it looks like it will be sweet.

Haskell is convenient for this type of research, because it isn't a purely-functional language--instead, it has a purely-functional core which you can supplement with state mechanisms of your choice.

And Haskell's not the only almost-purely-functional language with great scalability results. Erlang has a severely-restricted notion of state, and it's legendary for its scalability.

I'd love to know what results your correspondent is hinting at, because I'm frankly a bit confused.

# posted by

emk : 6:37 PM

Argh. The phrase "great scalability results" in the previous comment was intended to refer to Erlang, not Haskell. Haskell sees little use on massively concurrent hardware, but not, I would argue, because it has poorly-chosen concurrency primitives. (One real obstacle to high-performance Haskell is lazy evaluation.)

Anyway, here's an STM scalability paper:

Harris and Fraser, Language Support for Lightweight Transactions.

This paper benchmarks Java's ConcurrentHashMap against a simple Java STM hashmap. They're running on a 106-processor ccNUMA machine, utilizing slightly less than half the cores. And the STM implementation does just fine, even though it is far simpler than ConcurrentHashMap.

I also have a draft paper by Robert Ennals which reaches 90 cores (with high contention) on the same machine, also with good results. Google doesn't seem to have a working link.

So I would argue that Haskell's concurrency problems lie elsewhere, not with the choice of STM and NDP as concurrency models.

# posted by

emk : 7:15 PM

@anonymous

Cache coherence protocols only apply to shared memory, but there is usually a level of cache that isn't shared, which is closer to each individual processor. A strong (totally coherent) memory model have slower performance. Stale data usually needs to be considered, because the memory models of modern systems are weak. I've even seen problems with stale data in managed .NET code running on a single processor, but with Hyperthreading. A memory barrier resolved the issue.

# posted by

N432 : 8:34 PM

@"Some people will advocate launching one process per CPU and then using message passing to communicate between them. This is just a very inefficient (many orders of magnitude slower) way of pushing the synchronization issue off onto the OS/Network-stack."

I'm like "some people", except that I advocate using threads (not processes), and storing the messages (often just function pointers) in a simple queue. This is more efficient than using interprocess communication.

I don't just advocate message passing for its performance, it is also easy for humans to conceptualize. When threads share state inside locks, they are increasing the scope of that state to multiple threads. Personally, I have difficulty understanding a variable when its scope surpasses a function, let alone a thread.

# posted by

N432 : 8:56 PM

Kevin Greer basically described the disk storage design of CouchDb, as well my reasoning for the design.

It's also similar to how PostGreSQL is designed. I had a chance to talk with Dr. Stonebraker about it at his office in MIT.

# posted by

Damien : 9:00 PM

@n432

"I've even seen problems with stale data in managed .NET code running on a single processor, but with Hyperthreading. A memory barrier resolved the issue."

If it's occuring on a single core after turning on Hyperthreading, then "stale data" certainly isn't the problem. Logical processors share the cache (look at the "Caching and SMT" part).

Moreover, if you're using a cache coherency scheme, smaller caches are usually subsets of larger caches. That, and using write-back caching means that the large caches can handle coherency without much additional bus traffic, and make assurances about the contents of more smaller caches.

"Stale data usually needs to be considered, because the memory models of modern systems are weak."

What are you talking about? Intel, AMD, and co. go to incredible lengths to give a unified view of memory. You may not be able to do complex operations atomically (use a CAS if you want a guarantee), but the view of memory is unified. Even big NUMA machines go to incredible lengths to ensure coherency, even at the expense of the occasional tremendously painful memory access latency.

# posted by

Anonymous : 3:03 AM

I think that Kevin's email, while valuable, misses a very important point relating to the future of concurrent programming. The challenge is not simply to be able to "do concurrency" -- we already know that all the things Kevin mentioned are possible in mainstream languages such as Java and C#. The challenge is to make it less fricking difficult and error prone!

What Azul and others have done is amazing, but in the future we can't delegate all our programming to the geniuses at Azul. In imperative languages with a simple lock-based approach to concurrency, it is EXTREMELY hard even for very experienced developers to write concurrent code that is free of deadlocks, race conditions, priority inversion and starvation. Furthermore all those bugs tend to appear ONLY once the code has gone into production, and it is fantastically difficult to find and fix those bugs.

We really need to find structures that let programmers reason intuitively about concurrent programming, and in which it is hard to do the wrong thing rather than hard to do the wrong thing. Haskell offers two great solutions which fulfill this need: Software Transactional Memory (STM) and Nested Data Parallelism (NDP). Interestingly both of these techniques are much harder to implement in an imperative language than in a lazy, purely functional language like Haskell. In a direct contradiction to Kevin's assertions, I believe that Haskell is now one of the best languages for enterprise scalability (along with Erlang).

Incidentally if any of your readers live near London (UK) and would like to attend a talk by Simon Peyton Jones, who is working on implementing both STM and NDP in Haskell, they should come to the inaugural meeting of the London Haskell User Group. More details at http://www.londonhug.net/

Neil

# posted by

Neil Bartlett : 9:10 AM

I'll respond to multiple posts at once.

Omega Prime:

Yes, there has been some recent research on adding shared-memory multi-processor parallelism to Haskell but this is still in the early stages. The authors of the paper that you mention cite their work as “preliminary but promising” and it was only on a two CPU machine. It is based on Software Transactional Memory (STM) which is still very much an area of open research and it’s not yet clear when, if ever, this will become a viable solution. This may become a viable solution in the future, but it certainly isn’t yet. (One disadvantage of STM is that it doesn’t scale well when you have a lot of concurrent writes. This isn’t a problem if you can partition your threads to work in different areas but my original email to Reg was about what to do when you do have a need for concurrent updates.

There was also work done about a decade ago on concurrent Haskell based on MPP. This is similar to Erlang and has the disadvantage of not taking advantage of shared-memory on multi-core machines. It would be hard to build a system capable of handling the > 1 billion msgs/sec that would be required to match Azul’s concurrent hashmap.

See: http://citeseer.ist.psu.edu/246155.html

Anonymous:

“Uh, am I missing something here? This is why cache coherence protocols exist. Or am is this on some sort of weird machine that doesn't make guarantees about your view of memory?”

Yes, you are missing something: registers. While memory may exhibit cache coherence, registers don’t. For obvious performance reasons, Java use registers instead of memory whenever it can. When it does this, you lose cache consistency. This is why Java has the ‘volatile’ keyword to prevent certain variables from being cached in registers.

As I mentioned in my original email, this very common misconception is a great source of concurrency bugs.

You can find more information about Java’s memory model here:

http://www.cs.umd.edu/~pugh/java/memoryModel/

emk:

“embarrassingly parallel” tasks (like computer graphics, finite element analysis, etc.), have never been a problem given that the individual threads can work relatively independently (usually, with something like channels/pipelines between them). As you say the “Haskell implementation is still a work-in-progress”. These may very well turn out to be great future solutions, but certainly not “Enterprise-Scale” solutions for *today*.

Ironically, if you look at where STM has been shown to scale, it’s with a procedural language implementing non-functional data-structures. This doesn’t mean that you’ll be able to transfer this success to Haskell given that Haskell will need to obviously use purely-functional data-structures (PFDS). STM’s rely on updates being independent most of the time but PFDS guarantee that the only thing that you update is your root/world/top-data-structure. STM’s may turn out to be much more successful with languages like Java than with languages like Haskell.

My point was really, that yes there are things to be learned from functional data-structures, Haskell has yet to demonstrate its ability to effectively scale to the level required for enterprise-scale systems (my “opinion” is that it never will, but the “fact” is, that it hasn’t yet).

N432:

I’m also a big advocate of shared-memory (multi-threaded) message passing solutions for concurrency. In Java this can be done with Channels. There are a lot of other techniques that I use besides what I mentioned in the original post but the original conversation was about functional data-structures for concurrency.

Neil Bartlett:

“I think that Kevin's email, while valuable, misses a very important point relating to the future of concurrent programming.”

I’m writing and deploying enterprise-scale solutions today, and for that, I need solutions that work today, not in the future. I keep an eye on what’s being worked on (STM, shared-nothing) and what has been worked on (Actors, functional programming) for ideas on how to do things better today, and I’ve had a lot of success at incorporating these ideas into Java-based solutions today.

“The challenge is not simply to be able to "do concurrency" -- we already know that all the things Kevin mentioned are possible in mainstream languages such as Java and C#. The challenge is to make it less fricking difficult and error prone!”

We don’t actually have many problems with concurrency now that we have a suitable set of higher-level abstractions:
- FoldReduce (not to be confused with MapReduce)
- ThreadLocal Storage
- Thread Pools
- Futures
- Message Passing, Asynchronous Channels, Pipelines, Reactive Programming
- Parallel Functors
- Scheduled Jobs
- Concurrent Collections
- Socket Components

We (the developers at my company) very rarely need to work with Threads, mutexes, or synchronized blocks directly anymore. Yes, somebody needs to build these abstractions, just as somebody needs to build the compiler, but we’re at the point now that most developers shouldn’t have to worry about the low-level concurrency constructs.

“I believe that Haskell is now one of the best languages for enterprise scalability.”

What do you base this on? How many such systems have you deployed or seen deployed?

# posted by

Kevin Greer : 11:40 AM

Now that the subject has been hit, I've always wondered: are there any arguments against a (lot) of striping ? I'm currently in the process of implementing a cache which is going to be used by many many reader threads on a 8-core machine, and was considering to implement this cache with a 256-way striping implementation (as in, using the last byte of the identification number to determine which container to look into).

Now, I am aware of the extra memory footprint this will bring along, but this is irrelevant compared to the amount of elements to be stored inside the cache (around 5 million).

Are there any arguments against using a 256-way striping solution, compared to, for example, an 8-way solution ?

# posted by

Anonymous : 1:44 PM

are there any arguments against a (lot) of striping?

I am not speaking from hard-core experience here, but based on some modeling that I did when designing a distributed system, and a lot of speculative day-dreaming, you might want to think about (or better still, measure) the balance between operations that permute and operations that don't permute collections.

In essence, when you have a collection striped or distributed (whether across threads or machines a'la MapReduce), some operations preserve the distribution of data, and some permute it.

So if you have a Set, and you want to perform the Union or Intersection of that Set with another Set, this operation probably does not perform any permutation: everything in the result is in the same "stripe" as the portion of the input needed to calculate it.

But consider two Sets of numbers and you perform some operation like addition. Now the results might be in different "stripes." This creates a permutation.

That might be expensive, because it involves communication and synchronization between threads, processes, or systems, where operations within a thread are very cheap.

I am probably using the wrong word when I say "permutation," but I latched upon it when thinking about sorting large data sets that are distributed across systems.

Instead of measuring the number of operations, I needed two measurements: operations and messages. The messages are invariably much more expensive than the operations.

JM2C, I did not do extensive research.

# posted by

Reginald Braithwaite : 1:56 PM

"are there any arguments against a (lot) of striping?"

Heavily striped objects are slower to create and slower to garbage-collect. For short-lived, lightly-concurrent objects you may be better off at actually reducing the concurrency factor. For long-lived objects however I don't see any problem. (I saw in one of the Azul benchmarks that they set the concurrency level to 4096.)

The best way to tell what level of striping is best for your application is to benchmark it with various settings.

# posted by

Kevin Greer : 2:15 PM

Kevin: Thanks for the fascinating response! I agree that for near-term work on 100+ cores, Java's an excellent choice, and that Haskell isn't ready for this kind of application.

And you're also right about the limits of purely functional programming.

But Haskell's STM isn't purely functional. In fact, it supports mutable variables, imperative code, and all the standard techniques found in ordinary languages. (The syntax is shamefully ugly, but everything's there.)

The unusual thing about Haskell's STM support is that "mutable state" is (conceptually) all part of a library. If you don't like transactional state, you can replace it with something else--perhaps a protocol for updating lock-free variables. You design it (and hack it into GHC), and the type-checker enforces it.

So Haskell concurrency research is all over the map, with a regular zoo of different approaches. This includes work on hardware description languages, and there's a few people starting to poke at GPUs (much more work in the latter area is needed, perhaps once NDP is finished).

Any long-term strategy for concurrency will need to deal with these kinds of issues--a 500+ CPU ccNUMA machine is almost civilized compared to the 8 special-purpose cores on a Cell, or the headaches of an FPGA. And these are very hard architectures to target with the current Java toolset.

So from a research perspective, Haskell is a great place to study concurrency, because it gives you fine-grained control over what kind of state you want to support.

# posted by

emk : 2:29 PM

kevin greer:

Thanks for the link to the Java memory model info. It cleared everything up. In context, I hadn't realized that that bit of commentary was specific to weak memory models.

# posted by

Anonymous : 1:48 PM

There is forkOS that is smp enabled in ghc.

# posted by

george : 8:02 PM

<< Home