Planet


Cache Line Sizes and Concurrency

We've been looking at high concurrency level
issues with Drizzle and MySQL. Jay pointed me to this
article
on the concurrency issues due to shared cache lines and decided
to run some of my own tests. The results were dramatic, and anyone who
is writing multi-threaded code needs to be aware of current CPU cache
line sizes and how to optimize around them.

I ran my tests on two 16-core Intel machines, one with a 64 byte cache
line, and one with 128 byte cache line. First off, how did I find these
values?

one:~$ cat /proc/cpuinfo | grep cache_alignment
cache_alignment : 64
...

two:~$ cat /proc/cpuinfo | grep cache_alignment
cache_alignment : 128
...

You will see one line for each CPU. If you are not familiar with
/proc/cpuinfo, take a closer look at the full output. It's a nice quick
reference of other things like L2 cache sizes and CPU speed. As you can
see, machine one has a 64 byte cache size, and machine
two has a 128 byte cache size. Next, I wrote the following
C program to test concurrency:

cache_line.c

This program creates a global array of counter variables and runs a
variable number of threads, where each thread increments it's own 4-byte
counter in the array. It does so at a number of array spacing levels to
see the performance when counters fall on the same cache lines. With a
spacing of 1 the memory is directly adjacent, and for each spacing level
it skips that many counter variables in the global array. For example,
if spacing is 4, the threads would use counter[0], counter[4], counter[8],
and so on, which uses a chunk of memory every 16 bytes. The cache_line.c
program outputs a CSV formatted table that you can use to generate some
graphs. The seconds CSV output is the same set of tests without using
the global array counters, and instead a local counter on the stack. This
is meant to provide a baseline (since those will always be on their own
cache line). The results were:


64 Byte Cache Line


128 Byte Cache Line

So what does this tell us? When spacing is one and all counter memory
(16 threads * 4 bytes == 64 bytes) is entirely on one cache line,
concurrency is poor. As we add more space between each counter variable,
we start to see performance improve (faster runtime). This is because all
thread counters are no longer on one cache line. On the 64 byte cache
line machine, we see things really level off when spacing is 16. This
is because each counter is now on it's own cache line. On the 128 byte
cache line machine, you can see it takes one more iteration of spacing
because the cache line is twice as big.

So what can we take from this? If you have any arrays or data structures
that are accessed and updated independently from different threads, make
sure they are on a different cache line. This may mean wasting a little
space, but as you can see, the concurrency performance is well worth it.

C vs C++

Linux vs FreeBSD, vi vs emacs, MySQL vs PostgreSQL, your habit or favorite
technology vs another's. At the end of the day there is no winner, just
a matter of preference for the task at hand. I learned C++ 13 years ago,
I forgot most of my C++ knowledge 10 years ago, I discouraged the use of
C++ in this period in between, and in the past year I've been re-learning
C++ (mostly due to Drizzle). So what did I use after unlearning C++ 10
years ago? I wrote everything in C (and by everything I mean this was
my performance programming language of choice). This worked quite well,
but it's an interesting evolution that I think is now coming full circle.

When I first started programming C, it was a bit clumsy, and I look back
at my old code and cringe. I began to develop a certain programming style
that can best be described as object-oriented C programming due to the
conventions used. The structs, functions that operated on those structs,
and types all were nicely separated into abstract objects. Of course
C provides little protection so a user could still poke at what they
wanted, but the idea was there and things worked well if you stayed in
your sandbox.

As projects became more complex, my C code needed things like inheritance,
polymorphism, and templates (although I did not recognize them under those
names as I do now). How do you manage such construct in C? There are a
few tricks, one being to make your first struct member be a struct of a
base class, and cast up/down when switching context. Another trick is to
abuse C macros to give you either shared class methods or polymorphism. For
example:

#define HASH_ADD(__hash, __key, __obj, __prefix) { \
  if (__hash ## _hash[__key] != NULL) \
    __hash ## _hash[__key]->__prefix ## prev= __obj; \
  __obj->__prefix ## next= __hash ## _hash[__key]; \
  __obj->__prefix ## prev= NULL; \
  __hash ## _hash[__key]= __obj; \
  __hash ## _count++; \
}

This C macro will take any object with the correct data members and treat
it like a hash table. There are a few other related macros to manage
other hash table operations. I have a similar collection for lists,
queues, and other custom objects.

Now, the seasoned C++ programmer will look at these tricks as just say
"you've created a hacked up limited version of C++ that most people will
cringe at", and I'm starting to agree. It's time to stop fighting the
inevitable. This style of C programming worked very well for me mostly
because I was one of just a small handful of programmers that ever looked
at this code (most of it never made it open source sadly). Now that I'm
working on open source projects and working with developers of all levels,
I'm starting to see the value in more common structure and code protections
to allow more people to participate easily. A fresh new C++/Java programmer
out of college can understand and extend an abstract C++ plugin class more
more easy then they can figure out my C macro/casting constructs and work
within that framework. Even some experienced C programmers have to take
a minute to figure out what I was actually doing (not saying what I was
doing was advanced or anything, just more convoluted).

C++ classes scale better than custom C preprocessor macro "languages". By
"scale" I mean ease of developer understanding and use. Ever look at the
Zend/PHP header files?

So, what does this mean? I'm starting to understand the value C++ brings,
especially in code organization and compile time guarantees. You still get
that raw performance if you are careful and test new containers/libraries
(for example, stack in the STL does not perform all that well compared
to a raw array or even vector). Lately I've been working on a few micro
(and not so micro) benchmarks seeing what the difference is between C
and C++. There is a small price to pay if you start going crazy with
inheritance because each new layer adds some constructor/destructor
overhead. I avoided this in C by having large objects with everything
(and just used casting or macros). Real inheritance makes code much more
readable and extensible though.

Three years ago I never would have thought I'd be writing such a blog
post, but here I am. This new decision will start trickling into projects
I already work in, and new projects will be in C++ if appropriate. Some
things will remain in C, but any modular server or application development
will probably get switched. I am going to be extremely cautious about
performance and will do my best to ensure the move to C++ is not too
costly.

Are you offended by this? I'd love to hear your thoughts. Also, think
of it this way: at least I didn't say I'm starting to use Java. :)
(Sorry to you Java programmers, I'm just trying to stir up trouble)

Threads with Events

Last week I was surprised to see this
paper
bubble back up on Planet MySQL. It
describes the pros and cons of thread and event based programming for high
concurrency applications (like a web server), arguing that thread-based
programming is superior if you use an appropriate lightweight threading
implementation. I don't entirely disagree with this, but the problem is
such a library does not exist that is standard, portable, and useful
for all types of applications. We have POSIX threads in the portable
Linux/Unix/BSD world, so we need to work with this. Other experimental
libraries based on lightweight threads or "fibers" are really interesting
as they can maintain your stack without all the normal overhead, but it
is hard to get the scheduling correct for all application types. I would
even argue that thread and event based programming is actually not all
that different, it's just a matter of how state is maintained (stack vs
state variables) and how scheduling is performed.

The comparisons done in that paper also put a C-based web server using
a co-routine threading library against a Java based server that depends
on the poll() system call. I'm sorry, but this is comparing apples to
oranges. First, you're in the Java VM with a number of runtime components
(like garbage collection) which may be getting in the way. Also,
the standard poll() system call is not an efficient event-handling
mechanism, it's much better to use epoll or some other Kernel-based
handling mechanism.

One high-concurrency userland threading implementation I do like is in
Erlang. Erlang processes are extremely lightweight and I've written
apps that depend heavily on them. One interesting application I saw
was caching objects where each object got it's own Erlang process. This
put a whole new spin on cache management, and it looked like it could
actually scale reasonably well. The "problem" with Erlang, which may
or may not be a problem depending on your requirements, is that it is
still a bit of overhead running byte-code in a VM, as well as it being
a functional language. I love functional programming, but I've found it
still ties most developer's heads in knots if they don't have a reason to
use it regularly. For open source projects trying to build a contributor
community, it can act as one more hurdle.

So, what is the "best" paradigm?

Back in 2000 some colleagues and I wrote a hybrid thread-event library
that would create one event-handler instance per thread, and connections
would be spread across the pool of event-handling threads. I believe
this gave the best of both worlds, and I saw high throughputs with
fairly minimal overhead. I wrote a number of servers based on this
architecture, including HTTP, IMAP, POP3, and DNS, and with each server
type this model proved to be efficient and scalable. Ultimately the best
architecture depends on your application. If you never intend to have
many connections, and your applications has long-running computations,
one-thread-per-connection would probably be best. If you need to handle
large numbers of connections and have short, non-blocking request
processing, event-based scales extremely well. You can of course create
a hybrid of these two and have all connections managed by event threads
and asynchronous queues to dedicated processing threads for heavy request
handling (this is sort of what I did in the C Gearman Job Server).

There is no single correct answer, so take a look at your options before
deciding how to approach your own applications. Don't be afraid to create
hybrids as well. Regardless of which paradigm you choose, concurrent
programming can be hard, especially at the lower levels. There have been a
number of higher level abstractions to help developers, from new libraries
to new languages, but most of these come with a cost in performance or
flexibility. When you need to squeeze every bit of performance out of
your application, you will most likely end up in C or C++ dealing with
these issues directly.

This is actually one of the problems I'm attempting to address with the Scale Stack Event modules. I'm trying
to create a healthy level of abstraction on hybrid thread/event based
applications so you don't have any overhead or limitations while a lot
of the common headaches are taken care of for you. If you have a need
for such a system, get in touch, I'd be interested to talk. Since it is
BSD licensed you can use it in any application, including commercial.

XML feed