Planet
Percona XtraDB Cluster 5.5.20 – Beta release
I am happy to announce the availability of beta release of our new product Percona XtraDB Cluster.
Percona XtraDB Cluster is High Availability and Scalability solution for MySQL Users and the beta release is based on Percona Server 5.5.20 and the recently released Galera 2.0 GA
The main focus in this release:
- Incremental State Transfer, especially useful for WAN deployments
- Support of Percona XtraBackup for Snapshot State Transfer. This feature is available in pre-release series for XtraBackup 2.0
Despite Galera 2.0 has status “production ready”, we still name our release “beta”, as
we want to give an additional attention on an integration of Galera with Percona Server features and
on using XtraBackup for SST. Please test it if you are going to try XtraDB Cluster.
Percona XtraDB Cluster provides:
- Synchronous replication. Transaction either commited on all nodes or none.
- Multi-master replication. You can write to any node.
- Parallel applying events on slave. Real “parallel replication”.
- Automatic node provisioning.
- Data consistency. No more unsyncronised slaves.
Percona XtraDB Cluster is fully compatible with MySQL or Percona Server in the following meaning:
- Data compatibility. Percona XtraDB Cluster works with databases created in MySQL / Percona Server
- Application compatibility. There is no or minimal application changes required to start work with Percona XtraDB Cluster
Please note, this is a beta release, not suitable for production yet. Bugs and some rough edges are expected. We are encouraging you to try and play with it in testing environment and report us your feedback and experience.
We expect to have a production-ready release by the end of March 2012.
Links:
-
We provide tar.gz and RPM binaries for RedHat (CentOS, Oracle Linux) 5 and 6, and Debian packages.
Downloads: http://www.percona.com/downloads/Percona-XtraDB-Cluster/ -
Documentation
Codership Wiki - General Discussion group
- Launchpad project
- Bug reports
Previous posts on this topic:
Threads with Events
Last week I was surprised to see this
paper bubble back up on Planet MySQL. It
describes the pros and cons of thread and event based programming for high
concurrency applications (like a web server), arguing that thread-based
programming is superior if you use an appropriate lightweight threading
implementation. I don't entirely disagree with this, but the problem is
such a library does not exist that is standard, portable, and useful
for all types of applications. We have POSIX threads in the portable
Linux/Unix/BSD world, so we need to work with this. Other experimental
libraries based on lightweight threads or "fibers" are really interesting
as they can maintain your stack without all the normal overhead, but it
is hard to get the scheduling correct for all application types. I would
even argue that thread and event based programming is actually not all
that different, it's just a matter of how state is maintained (stack vs
state variables) and how scheduling is performed.
The comparisons done in that paper also put a C-based web server using
a co-routine threading library against a Java based server that depends
on the poll() system call. I'm sorry, but this is comparing apples to
oranges. First, you're in the Java VM with a number of runtime components
(like garbage collection) which may be getting in the way. Also,
the standard poll() system call is not an efficient event-handling
mechanism, it's much better to use epoll or some other Kernel-based
handling mechanism.
One high-concurrency userland threading implementation I do like is in
Erlang. Erlang processes are extremely lightweight and I've written
apps that depend heavily on them. One interesting application I saw
was caching objects where each object got it's own Erlang process. This
put a whole new spin on cache management, and it looked like it could
actually scale reasonably well. The "problem" with Erlang, which may
or may not be a problem depending on your requirements, is that it is
still a bit of overhead running byte-code in a VM, as well as it being
a functional language. I love functional programming, but I've found it
still ties most developer's heads in knots if they don't have a reason to
use it regularly. For open source projects trying to build a contributor
community, it can act as one more hurdle.
So, what is the "best" paradigm?
Back in 2000 some colleagues and I wrote a hybrid thread-event library
that would create one event-handler instance per thread, and connections
would be spread across the pool of event-handling threads. I believe
this gave the best of both worlds, and I saw high throughputs with
fairly minimal overhead. I wrote a number of servers based on this
architecture, including HTTP, IMAP, POP3, and DNS, and with each server
type this model proved to be efficient and scalable. Ultimately the best
architecture depends on your application. If you never intend to have
many connections, and your applications has long-running computations,
one-thread-per-connection would probably be best. If you need to handle
large numbers of connections and have short, non-blocking request
processing, event-based scales extremely well. You can of course create
a hybrid of these two and have all connections managed by event threads
and asynchronous queues to dedicated processing threads for heavy request
handling (this is sort of what I did in the C Gearman Job Server).
There is no single correct answer, so take a look at your options before
deciding how to approach your own applications. Don't be afraid to create
hybrids as well. Regardless of which paradigm you choose, concurrent
programming can be hard, especially at the lower levels. There have been a
number of higher level abstractions to help developers, from new libraries
to new languages, but most of these come with a cost in performance or
flexibility. When you need to squeeze every bit of performance out of
your application, you will most likely end up in C or C++ dealing with
these issues directly.
This is actually one of the problems I'm attempting to address with the Scale Stack Event modules. I'm trying
to create a healthy level of abstraction on hybrid thread/event based
applications so you don't have any overhead or limitations while a lot
of the common headaches are taken care of for you. If you have a need
for such a system, get in touch, I'd be interested to talk. Since it is
BSD licensed you can use it in any application, including commercial.
Scale Stack vs node.js vs Twisted vs Eventlet
We've been discussing switching from Tornado to either
Twisted
or Eventlet for Nova (the compute project for
OpenStack), so I decided
to setup a test to see if there are performance differences to
take into consideration. While I was at it I decided to include node.js since that's all the rage these
days, as well as Scale Stack, a C++
project I started earlier this year.
The Test
I wanted to check for two main factors: handling of large numbers of
concurrent connections and the overhead with transferring large amounts of
data. To do this I wrote a simple echo server in each framework and then
used the Scale Stack echo flood tool to test each one. The tool allows
you to specify the number of concurrent connections and how much data to
send and verify in 32k chunks. You can find the echo server and flood tool
for Scale Stack in the project
source code. For each of the others, here is the echo server source:
node.js
var net = require('net'); net.createServer(function (socket) { socket.on("data", function (data) { socket.write(data); }); socket.on("end", function () { socket.end(); }); }, {backlog: 32768}).listen(12345, "localhost");
Twisted
from twisted.internet.protocol import Protocol, Factory from twisted.internet import epollreactor epollreactor.install() from twisted.internet import reactor class Echo(Protocol): def dataReceived(self, data): self.transport.write(data) factory = Factory() factory.protocol = Echo reactor.listenTCP(12345, factory, backlog=32768) reactor.run()
Eventlet
import eventlet def handle(fd): while True: c = fd.recv(16384) if not c: break fd.sendall(c) server = eventlet.listen(('0.0.0.0', 12345), backlog=32768) pool = eventlet.GreenPool(size=32768) count = 0 while True: new_sock, address = server.accept() pool.spawn_n(handle, new_sock)
Setup
Since none of the frameworks run multi-core for this test (although Scale
Stack could), I decided to use my laptop which is a 2.4ghz Core 2 Duo with
4GB of memory running Ubuntu 10.4. There will be one core for the server,
and one for the client. Doing the test on a single machine also lets us
cut network bottlenecks out of the picture since it all runs through
the local interface. In order to test at the high connection counts,
I needed to tweak some system limits. I allow for 64k file descriptors
per process in /etc/security/limits:
root soft nofile 65535 root hard nofile 65535 * soft nofile 65535 * hard nofile 65535
You'll notice really high listen backlog settings for the echo server
code above. The kernel limits need to match this as well so we need to
set these new limits in /proc. I also increased the ephemeral port range
so we can get up to 32k active client connections and reduced the kernel
socket buffer sizes so I don't out of memory. These can be set with:
echo 32768 > /proc/sys/net/core/netdev_max_backlog echo 32768 > /proc/sys/net/core/somaxconn echo "21000 61000" > /proc/sys/net/ipv4/ip_local_port_range echo 8192 > /proc/sys/net/core/rmem_default echo 8192 > /proc/sys/net/core/wmem_default
With the system limits set, I started running the flood tool with
connection counts from 1 through 32k. For each connection count, I ran
the test with the connection echoing 32k of data and 512k of data. I ran
each test three times for each server and took the lowest time (times
were very consistent across the board, so any sample would have done).
Results

Graph of the result listed below.
| 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | 32768 | |
| Scale Stack 32k | .05 | .05 | .05 | .05 | .06 | .06 | .07 | .10 | .14 | .20 | .32 | .52 | .91 | 1.72 | 3.49 | 6.83 |
| node.js 32k | .05 | .05 | .05 | .05 | .06 | .06 | .08 | .10 | .14 | .20 | .35 | .67 | 1.30 | 2.43 | 5.14 | 10.11 |
| Twisted 32k | .05 | .05 | .05 | .06 | .06 | .06 | .08 | .10 | .14 | .24 | .39 | .73 | 1.43 | 2.71 | 5.33 | 10.54 |
| Eventlet 32k | .05 | .05 | .05 | .05 | .06 | .06 | .07 | .10 | .14 | .20 | .32 | .62 | 1.15 | 2.26 | 4.60 | 9.30 |
| Scale Stack 512k | .05 | .05 | .06 | .06 | .08 | .10 | .16 | .25 | .45 | .82 | 1.59 | 3.13 | 6.34 | 12.56 | 24.97 | 50.25 |
| node.js 512k | .05 | .08 | .10 | .12 | .16 | .22 | .29 | .49 | .85 | 1.51 | 3.10 | 6.32 | 12.61 | 26.75 | 58.56 | 117.04 |
| Twisted 512k | .05 | .06 | .06 | .07 | .10 | .13 | .22 | .43 | .82 | 1.63 | 3.24 | 6.22 | 11.48 | 22.72 | 44.48 | 89.46 |
| Eventlet 512k | .06 | .06 | .06 | .07 | .10 | .12 | .21 | .35 | .69 | 1.29 | 2.61 | 4.91 | 9.68 | 19.55 | 38.62 | 77.80 |
After the above tests, I also started each server up one at a time and
ran a 32k connection client that sent data indefinitely to saturate the
process. Here are the vmstat numbers of my system during these tests:
| Context Switches | User % | System % | Idle % | Client Delay (s) | |
| Scale Stack | 700 | 7 | 93 | 0 | 6.30 |
| node.js | 9k | 31 | 40 | 29 | 7.07 |
| Twisted | 13k | 27 | 45 | 28 | 8.93 |
| Eventlet | 24k | 28 | 49 | 23 | 5.15 |
In all cases the server process was consuming an entire core. The idle
times were on the core running the client tool, since the server could not
always keep up with the client load. The last column labeled "Client Delay"
was another time test I ran while the server was saturated to measure
response time. For this test, a client would connect, send 32k of data,
wait for the echo response, and then disconnect. Results are in seconds
for this test.
Conclusions
I was very impressed with how node.js and the Python frameworks held
up. I've been writing event-driven servers in C/C++ for the past decade
or so and didn't think the higher level languages could handle this kind
of load as well as they did. My only concern with node.js or Python
is not being able to use all the cores on your system. Some services
are well suited to run multiple server process on a single machine or
to farm work out to worker process pools to utilize all your cores, so
this will be less of an issue. Other services are best implemented when
all connections are in a single process and use thread pools instead. For
that you'll still need to rely on a C or C++ based server (Scale Stack is
meant to be a framework like the others to help in these cases). Servers
written in Erlang or Java would probably perform decently across multiple
cores as well.
For short lived connections transferring less than 32k of data, all
frameworks scaled very well. When a larger amount of data was being sent
we started to see some differentiation. This could be due to buffering
techniques or simply the overhead of calling into the language handlers
more often. The increase in user % in the processor utilization from
the vmstat output for node.js and Python supports this. Scale Stack
only buffers once on read and has less runtime overhead since it is not
running in an interpretor. The node.js and Python servers may be able
to be optimized to avoid double buffering if that is indeed happening,
please let me know if that is the case.
As far as the original question of Twisted vs Eventlet, I don't think
performance will be much of a deciding factor. Eventlet has a slight boost
in performance and claims to be easier to write services in, but other
folks still swear by Twisted. It is probably safe to say that available
framework features and personal preference will be the deciding factors.
Update - August 6, 2010
I decided to run a few more versions for just the 32k connection, 512k
data test. Below are the repeated times for the original four, plus Erlang,
regular Python threads, and two versions of Go.
- Scale Stack 50.25
- node.js 117.04
- Twisted 89.46
- Eventlet 77.80
- Erlang 61.65
- Python threads 111.04 (lots of memory even with minimal stack size)
- Go v1 62.95
- Go v2 59.73
The Go version is very impressive, almost as fast as the C++ version. Of
course these last four you get SMP without any extra work, which is a
bonus. It turns out the default socket buffer sizes in Erlang are only
1500 bytes (MTU size). So be sure to push these up (in this test I set
it to 16k). Memory consumption with the Erlang server was also fairly low
(peak around 400M, usually around 150M).



