Baron Schwartz

Syndicate content
Stay curious!
Updated: 20 min 50 sec ago

What are your favorite MySQL bug reports?

Tue, 24/08/2010 - 09:00

Bug reports can be fun. They can also be terrible. Either way they can be entertaining. On the Drizzle IRC channel today I saw a couple references to MySQL bug reports: it is stop working and Does not make toast (which reminds me of the Mozilla bug report about the kitchen sink). Got any other favourites1?

1 This one’s for Jay.

Related posts:

  1. What are your favorite MySQL replication filtering rules?
  2. My new favorite comic: The Adventures of Ace, DBA
  3. What is your favorite database design book?
  4. What are your favorite PostgreSQL performance resources?
  5. My favorite wiki is Dokuwiki

Free webinar on MySQL performance this Thursday

Tue, 24/08/2010 - 02:35

ODTUG invited me to give a webinar and I said yes, so this Thursday you’re invited to join me as I talk about MySQL performance. We’ve come a very long way towards a MySQL that can perform well on modern hardware, and there really isn’t broad recognition of this. A lot of the best work has gone into the InnoDB “plugin” storage engine, which was announced after my co-authors and I sent High Performance MySQL to the press. I will explain what you should be doing differently now than you did two years ago, and suggest a performance-in-a-nutshell configuration baseline for MySQL that’s quite different from what I’d have said in 2008. You can register for free through GoToWebinar. See you there.

Related posts:

  1. Get a free sample chapter of High Performance MySQL Second Edition
  2. MySQL: Free Software but not Open Source
  3. Speaking at NovaRUG on Thursday
  4. High Performance MySQL is going to press, again
  5. Coming soon: High Performance MySQL, Second Edition

Oracle is improving MySQL

Wed, 18/08/2010 - 22:50

I’ve noticed that a steady and perhaps even growing number of bug reports and feature requests are getting resolved for the next milestone release. I continue to see signs that Oracle’s next release of MySQL will not only include much of the unreleased good work that’s been done over the last few years, but will add a lot of new features and fixes as well.

Related posts:

  1. MySQL Enterprise/Community split could be renewed under Oracle
  2. 50 things to know before migrating Oracle to MySQL
  3. Migrating US Government applications from Oracle to MySQL
  4. A review of Forecasting Oracle Performance by Craig Shallahamer
  5. A review of Optimizing Oracle Performance by Cary Millsap

Two subtle bugs in OUTER JOIN queries

Tue, 03/08/2010 - 12:38

OUTER JOIN queries in SQL are susceptible to two very subtle bugs that I’ve observed a number of times in the real world. Daniel and I have been hammering out ways to automatically detect queries that suffer from these bugs, in a relatively new Maatkit tool called mk-query-advisor. It’s part of our series of advisor tools for MySQL. I wrote a blog post about it a while ago. Automated analysis of bad query patterns is a good thing to write tools to do, because catching buggy queries is hard work if you do it manually.

Let’s dive right in and analyze these subtle bugs. Warning: if you don’t understand how SQL handles NULL, you’re not going to understand the following. Many people have a hard time with NULL, which is why these bugs are so hard to understand and avoid. This is one reason why SQL is a hard language to use properly.

Bug 1: a column could be NULL for two reasons, and you can’t distinguish them

If the outer table in your query contains NULL-able columns, and you place a WHERE clause to filter out all but those rows, you’re going to get bugs because a non-matching row in the outer table will be all-NULL. Here’s an example. Let’s start with a plain outer join query:

select * from L left join R on l_id = r_id; +------+------+---------+ | l_id | r_id | r_other | +------+------+---------+ | 1 | 1 | 5 | | 2 | 2 | NULL | | 3 | NULL | NULL | +------+------+---------+

Here we see that one row in the outer table is missing, and one row (the middle row) has a NULL r_other column. Now, let’s add a WHERE clause:

select * from L left join R on l_id = r_id where r_other is null; +------+------+---------+ | l_id | r_id | r_other | +------+------+---------+ | 2 | 2 | NULL | | 3 | NULL | NULL | +------+------+---------+

This query is buggy, because the two rows are returned for completely different reasons, and you can’t be sure which is which. IS NULL clauses can safely be placed on the columns used in the JOIN clause, but not on other columns in the outer table that might be NULL.

Bug 2: an OUTER JOIN is converted to INNER

If you place a non-null-safe comparison operator on any column in the outer table that isn’t part of the JOIN clause, you implicitly disable the outer-ness of the query and convert it to an INNER JOIN. Here’s an example:

select * from L left join R on l_id = r_id where r_other > 1; +------+------+---------+ | l_id | r_id | r_other | +------+------+---------+ | 1 | 1 | 5 | +------+------+---------+

The left-outer-ness of the above query is what causes the third row to be output in the first query I showed you above. The greater-than operator in this example automatically makes the left-ness impossible, because anytime there’s a row in the inner table that has no match in the outer table, it’ll be filled in with NULLs, and those NULLs will be eliminated by the operator. So the effect is that only matching rows will ever be output.

If you want to ponder variations and subtleties of the above, you can read more discussion on the issue report where we’re hammering out the details of automatically detecting and warning about these sneaky errors.

Related posts:

  1. How to simulate FULL OUTER JOIN in MySQL
  2. How to write a SQL exclusion join
  3. How to write SQL JOIN clauses more compactly
  4. The dangerous subtleties of LEFT JOIN and COUNT() in SQL
  5. How to write INSERT IF NOT EXISTS queries in standard SQL

Speaking at MySQL Meetup in Northern Virginia

Wed, 21/07/2010 - 23:34

The closest thing I know of to a “Northern Virginia MySQL Meetup” is the Sterling Database Data Solutions Group. I got in touch with the organizer and we scheduled a meeting next Wednesday July 28th. I’ll be presenting, and so will someone from Fusion-IO, a solid-state storage vendor. This is on short notice, so tell your friends about it! It would be great to grow a strong monthly meetup presence in this area.

Here’s the abstract I sent: “This talk covers best practices to help you get the most out of MySQL performance. It assumes you know a database well, though it need not be MySQL. We’ll cover several angles of the topic. Configuration is usually the first thing people ask about. Although it’s possible to misconfigure MySQL and get bad performance, the configuration options you need for good performance are few and rather simple. We’ll see how to inspect MySQL’s performance and status, also a fairly simple subject. Next is query tuning. There are a few surprises in MySQL due to its simpler query execution engine than Oracle or SQL Server. We’ll see how to avoid those surprises and work with the query optimizer. Finally, we’ll focus on what you should know if you are considering migrating part or all of your application from Oracle. There will be plenty of time for questions, so bring yours!”

Related posts:

  1. I’ll be speaking at the O’Reilly MySQL Conference 2010
  2. Speaking at NovaRUG on Thursday
  3. Speaking at Surge 2010
  4. Speaking at EdUI Conference 2009
  5. Speaking at CPOSC 2009

Speaking at Surge 2010

Tue, 20/07/2010 - 01:47

OmniTI’s Surge conference is looking really good — and I’m going to be speaking there. The CfP just closed, so the list of speakers is still growing, but it already includes impressive names such as Neil J. Gunther. So far, this speaker list has zero fluff, and reminds me of the Percona Performance Conference. I’ll be talking about how not to shard your systems. Sharding is no fun and it’s costly. If you don’t have to do it — and many applications don’t need to, with orders-of-magnitude performance improvements in MySQL — you should not.

Related posts:

  1. I’ll be speaking at the O’Reilly MySQL Conference 2010
  2. Videos and slides for MySQL Conference 2010
  3. Speaking at EdUI Conference 2009
  4. Speaking about Maatkit at CPOSC
  5. Learn about mk-query-digest at PgEast 2010

Aspersa’s mysql-summary tool

Sun, 11/07/2010 - 05:46

For those of you who miss what Maatkit’s mk-audit tool (now retired) gave you, there’s a pair of tools in Aspersa that more than replaces it. I wrote previously about the summary tool. I don’t think I have mentioned the mysql-summary tool. It has been under development for a while, and at this point it has quite a lot of functionality. You can see a sample of the output on its wiki page.

Related posts:

  1. Apsersa’s summary tool supports Adaptec and MegaRAID controllers
  2. Aspersa, a new opensource toolkit
  3. Using Aspersa to capture diagnostic data
  4. MySQL Toolkit’s Show Grants tool 0.9.1 released
  5. Introducing MySQL Toolkit’s Show Grants tool

A review of Guerrilla Capacity Planning by Neil Gunther

Wed, 07/07/2010 - 11:54

Guerrilla Capacity Planning

Guerrilla Capacity Planning. By Neil J. Gunther, Springer 2007. Page count: about 200 pages, plus appendixes. (Here’s a link to the publisher’s site.)

Of all the books I’ve reviewed, this one has taken me the longest to study first. That’s because there is a lot of math involved, and Neil Gunther knows a lot more about it than I do. Here’s the short version: I’m learning how to use this in the real world, but that’s going to take many months, probably years. I’ve already spent about 10 months studying this book, and have read it all the way through twice — parts of it five times or more. Needless to say, if I didn’t think this was a book with value, I wouldn’t be doing that. But you’ll only get out of this book what you put in. If you want to learn a wholly new way to understand software and hardware scalability, and how to do capacity planning as a result, then buy the book and set aside some study time. But don’t think you’re going to breeze through this book and end up with a simple N-step method to take capacity forecasts to your boss. If you want that, buy John Allspaw’s book instead. (If you’re reading this blog post, you need that book.)

I don’t want to spend a lot of time talking about Neil’s method, because honestly the book isn’t about the method first and foremost, and I think many readers will have a hard time digging the capacity planning method out of the math-ness. This book is, in a sense, a textbook or workbook for his training courses. It begins with a lot of general topics, such as how managers think about capacity, risk, what’s needed in the world of businesses that are driven by Wall Street, ITIL, and so on. Then there’s the mathematical background for the rest of the book, things like significant digits and expressing error.

The part of the book that I’m still studying begins in Chapter 4, which introduces ways to quantify scalability. The math begins with Amdahl’s Law, which you may have heard of. It turns out that not only can this be used to understand how much overall speedup is possible by speeding up part of a process, which is how I’m used to using it, but it can be used to model what is possible with parallelization. (I think I actually learned this in my university classes, but I’d forgotten its uses in parallel computing since then.) Anyway, it’s a straightforward model that makes intuitive sense and is easy to accept. I believe in it because it’s so logical and simple, and because I’ve worked with it for a long time. That’s the last bit of math in this book that I can understand so solidly, because after that, we get into a lot of things that have to do with interaction between concurrently performed work, and nothing is ever intuitive about that domain.

Now, when you’re talking about scalability, you generally are working with scalability of concurrent systems, and queueing theory is Topic Number One. Proper queueing theory is correctly modeled, under certain very restricted conditions, by the Erlang C formula. This is a complex bit of math, and although I believe in it, I don’t understand it enough to know how it’s derived or proven to be correct. Well, there’s no Erlang C math in this book. Neil Gunther goes a completely different direction. Instead of modeling the impact of queueing through the math that describes the model, he creates a new model. Let’s leave the model for later, and just look at what’s nice about not using Erlang C math to model computer system scalability:

  • The Erlang C formula requires complex calculations.
  • It is valid only in restricted conditions, and it’s a lot of work to prove that your workload conforms.
  • It models queueing delay, but it doesn’t model coherency delay.
  • It requires inputs such as service time, which are difficult or impossible to measure accurately.

Someone once said that all models are wrong, but some models are useful. Neil Gunther heads in the direction of a more useful model. First, he proves that two parameters are necessary and sufficient to create a realistic model. Next, he introduces another parameter into Amdahl’s Law to account for coherency delay. The resulting (still simple) equation models serial delay (the reverse of parallel speedup) and coherency delay. Now we have a model for how a system scales under a given workload as you increasingly parallelize the hardware. This is the universal scalability model. From the mathematical point of view, it’s the crowning achievement of this book. I’m very much summarizing, by the way. There’s a lot to think about in developing such a model, so the reader gets quite a tour de force here. Along the way Neil shows how you can arrive at the same surprising result through an entirely different route, without even using Amdahl’s Law as a starting point.

There are other models. Neil discusses these. They all have problems. Some don’t model what we know can happen in the real world — retrograde scaling — where performance can decrease when you add more power to a system. Others are physically impossible, predicting negative speedup. Negative speedup means the system’s performance goes below zero. As in, you ask it to do work, and it, uh, takes back work it’s already done? Impossible. So it certainly looks like Neil’s model is the strongest contender. By the way, Craig Shallahamer’s book on forecasting Oracle performance uses the universal scalability model, although without the mathematical rigor.

Now, the problem is how to apply this in the real world. To model a system’s performance, you have to know the value of those two magical parameters. How on earth can you find these values? This seems to be just as hard as Erlang C math. But Neil shows the second most remarkable thing: if you transform the universal scalability model around a bit, then you get a polynomial of degree two. This is exciting because if you take some measurements of your system’s observed performance at different points on the scalability curve (holding the work per processor constant, and adding more processors), and then transform those measurements in an equivalent manner, you can fit a regression curve through those points. Now you can reverse the transformations to the equation, plug in the coefficients of the quadratic equation that resulted from the curve-fitting, and out come the parameters you need for the universal scalability equation! Final result: you can extrapolate out beyond your observations and predict the system’s scalability.

We’re not done. All of this was about hardware scalability: “how much faster will this system run if I add more CPUs?” Software scalability is next. Neil goes back to the basics, starting with how Amdahl’s Law applies to software speedup, and essentially covers all the same ground we’ve already covered, but this time modeling what happens when you hold the hardware constant, and increase the concurrency of the workload the software is serving. It turns out that exactly the same scalability model holds for software as it did for hardware. This is why he calls it the universal scalability model. But not only that, it works for multi-tier architectures of arbitrary complexity.

And this is why I say I am not competent to really prove or disprove the validity of the whole thing. It makes sense to me that even a multi-tier architecture can conform to a model with two parameters. As we know in the real world, there is usually a single worst bottleneck, a weakest link. And therefore no matter how complex the architecture, the dominant factors limiting scalability are still coherency and/or queueing at the bottleneck, and how much you can parallelize (Amdahl’s Law). Thus, the universal scalability model intuitively might be valid for such architectures. But proving it — wow, that’s way beyond me. I know my limits. I’m taking it all on faith, experience, and intuition at this point.

In my mind, the results Neil Gunther derives up to this point in the book would have been plenty. However, there’s lots more left in the book. The rest of the book is about how to use the model for capacity planning, but surprisingly, it’s not about just how to use the universal scalability model. It’s about Guerrilla Capacity Planning in the real world. Right after exploring software scalability, he dives into virtualization for a whole chapter — and then shows you how to measure, model, and predict the scalability of various virtualization technologies. Next chapter: web site capacity planning. After that? “Gargantuan Computing: GRIDs and P2P.” Yep, he analyzes the scalability limits of Gnutella and friends. And then, apparently just because he can, he dissects arguments about network traffic in general (read: “how scalable is the Internet?”). I can’t pretend to understand all this myself. I’m just following along.

I have a feeling that Neil Gunther is kind of like Einstein: his real gift is his ability to create thought experiments that make the model accessible to mortals. Maybe someday he’ll be a legend you learn about in CS101 classes, or maybe someday he’ll be proven wrong like Newton, who knows. In the meantime, I’m going to keep working on applying it all in the real world, especially to MySQL, and see what comes of it. The fact that I’m still doing that bears out what I said earlier: you aren’t going to just waltz through this book and come away with a clear picture of how to work through a capacity planning method. You’ll have some work to do. If you want an elegant and simple capacity planning method, then you should buy John Allspaw’s The Art Of Capacity Planning instead.

Related posts:

  1. A review of The Art of Capacity Planning by John Allspaw
  2. A review of Forecasting Oracle Performance by Craig Shallahamer
  3. A review of SQL and Relational Theory by C. J. Date
  4. A review of Optimizing Oracle Performance by Cary Millsap
  5. A Review of Beginning Database Design by Clare Churcher

A review of Cloud Application Architectures by George Reese

Mon, 05/07/2010 - 12:36

Cloud Application Architectures

Cloud Application Architectures. By George Reese, O’Reilly 2009. (Here’s a link to the publisher’s site).

This is a great book on how to build apps in the cloud! I was happy to see how much depth it went into. It’s short — 150 pages plus some appendixes — so I was expecting it to be a superficial overview. But it isn’t. It is thorough. And it is also obviously built on his own experience building very specific applications that he uses to run his business — he isn’t preaching about stuff he doesn’t know first-hand. Finally, George Reese is a good writer! It’s impressive. This is how he covers so much ground with so much depth in so few pages, and it all makes sense. He takes a side trip every now and then, but it’s always in the right place at the right time — how to do a snapshot for backups, for example — and isn’t distracting. For a technical book, it has an amazing narrative flow.

The book begins with an intro to cloud computing in general, with definitions and an explanation of different models, plus cost estimates of traditional IT, managed hosting, and cloud computing for an app. There’s a brief overview of the Amazon platform. This book is mostly about Amazon, and states that up front. There are references and comparisons to other providers throughout, and later there’ll be two appendixes on GoGrid and Rackspace, each written by a representative of that company. I was happy that the author brought in people to write those, instead of doing it himself. They are non-promotional in nature, and quite short. That adds value to the book, which would have been fine without them, honestly.

Back to chapter two now — a deeper introduction to Amazon, moving through all the major components, but especially EC2, S3, and EBS. Here we also start to see a focus on the platform as a whole — availability zones, security, redundancy, reliability. These topics are treated fairly and woven into every chapter. It’s clear that the author doesn’t want to isolate these topics, but rather explain them in context so your mind is always on them as each new topic is introduced. Chapter 3 picks all this up again: considering a move into the cloud? More cost comparisons, more explanations of concepts such as availability and how they translate into the Amazon cloud. Performance, disaster recovery and a few other topics show up here.

Chapter 4 is about how to build an app in the cloud: web app design, making multiple machines work together, handling failure, building AMIs, privacy, and operating databases (especially MySQL) in the cloud. The privacy section is particularly good. I’d recommend this to anyone building an app that might process personally identifiable information or financial information, in or out of the cloud. And as I said already, this is one of the types of things he weaves into the whole book. Chapter 5 picks right up and keeps going: it’s about security. Data security, regulatory compliance, network security, host security, how to respond if there’s a breach. And then Chapter 6 is on disaster recovery: planning, implementing, managing.

Chapter 7 is titled “scaling,” but it’s more than that. It starts with capacity planning. Here’s one of my favorite quotes: “some think they no longer need to engage in capacity planning… [others] think of tens or hundreds of thousands of dollars in consulting fees. Both thoughts are dangerous myths…” There’s a reference to John Allspaw’s excellent book on capacity planning. (I saw that he was a tech reviewer for this book, too.) This chapter covers how you predict and provision for capacity needs in the cloud, including the “automatic scaling” holy grail, how it can bite you, and how to keep that from happening. It also talks about how you scale vertically in the cloud. It doesn’t talk about why it’s hard to really be sure about your capacity needs in the cloud, but that’s okay given the other material covered in the chapter.

And that’s it! After this, it’s 3 appendixes. One is an AWS reference, and then there’s the two on GoGrid and Rackspace.

What’s to criticize? Well, not a lot really. I read every word in this book, I promise. Here’s what I noticed: he talked about database corruption from unexpected shutdowns — he should have said “use InnoDB,” because that’s pretty much a MyISAM problem. He talked about taking backups from replication slaves — he should have said “don’t just trust replication, verify it with mk-table-checksum.” I also think he encourages a little too much trust that cloud providers are always magically going to have the capacity you need; it felt a bit naive, but this is actually a fundamental point in whether you’re going to use the cloud or not. Nobody knows how much excess capacity Amazon has, and as we know, weird things happen. But if you’re going to embrace a cloud platform, you’re going to have to trust to a certain extent.

A couple other things to nitpick: in Chapter 1, when talking about availability, he writes “[if] even 1 minute of downtime in a year is entirely unacceptable, you almost certainly want to opt for a managed services environment… [if] 99.995% is good enough, you can’t beat the cloud.” But these numbers are unrealistic and don’t have enough context to explain what he means. Finally, in a couple of places he talks about algorithms for generating unique identifiers and dealing with concurrent access, but these don’t have a deep enough explanation to prevent novices from shooting themselves in the foot with wrong assumptions such as a timestamp will always increase between each subsequent access. But a savvy developer will recognize those problems and won’t be bitten.

This book is the first one to go onto my list of essential books in a while. I’ll be keeping this one on my own bookshelf.

Related posts:

  1. Review of Scalable Internet Architectures by Theo Schlossnagle
  2. A review of The Art of Capacity Planning by John Allspaw
  3. Under-provisioning: the curse of the cloud
  4. A review of Pentaho Solutions by Roland Bouman and Jos van Dongen
  5. A review of Guerrilla Capacity Planning by Neil Gunther

A review of Web Operations by John Allspaw and Jesse Robbins

Sun, 04/07/2010 - 12:02

Web Operations

Web Operations. By John Allspaw and Jesse Robbins, O’Reilly 2010, with a chapter by myself. (Here’s a link to the publisher’s site).

I wrote a chapter for this book, and it’s now on shelves in bookstores near you. I got my dead-tree copy today and read everyone else’s contributions to it. It’s a good book. A group effort such as this one is necessarily going to have some differences in style and even overlapping content, but overall it works very well. It includes chapters from some really smart people, some of whom I was not previously familiar with. John and Jesse obviously have good connections. A lot of the folks are from Flickr.

Here are the highlights in my opinion.

  • Theo Schlossnagle, who has a place on my list of essential books, opens things with an overview of what web operations really is, and why it’s hard. Don’t skip this. Theo’s introduction is concise and thoughtful.
  • Eric Ries discusses the benefits of continuous deployment. He is right on the money. Right out of college I spent 3 years as a developer at a company with very little engineering discipline, and then left for another company built by a small ace team practicing extreme programming. Eric nails the benefits of continuous deployment — he really gets it. I hadn’t heard of Eric before, but now I’ve subscribed to his blog.
  • John Allspaw (whose book on capacity planning is also on my list of essentials) and Richard Cook discuss how complex systems fail. This chapter appeared in part as a whitepaper and blog post on John’s blog, and is expanded in this book. I have spent a lot of time examining failures for clients, and as VP of Consulting, also a lot of time examining Percona’s own mistakes. I fully agree with the conclusions in this chapter. A few key points: there is never a single root cause; our desire to find one blinds us and keeps us from learning; true failures are inherently unpredictable and happen only when a series of things fails; avoiding failure requires experience with failure. This echoes another book I’ve read recently, The Black Swan.
  • Brian Moon’s chapter on unexpected traffic spikes. If you get a chance to hear Brian speak, take it. He’s an engaging guy with interesting and relevant stories to tell. Stories are always a better experience than bullet points.
  • Jake Loomis’s chapter on postmortems. My own research into prevention of emergencies agrees almost perfectly with his list of things to do on page 225. Read this chapter carefully! Now, knowing how to put this into action is hard — very hard — but at least you’ll have a place to start. The worst compliment I ever got after fixing a system that’d run out of hard drive space (due to utter lack of basic monitoring) was that I’d “saved the day.” Baloney. Postmortems can be a great way to learn your infrastructure’s weaknesses and prevent emergencies in the future. I’m fully confident that this particular client will again deploy new servers without adding them into Nagios, and the results will be predictable.
  • Naturally, my chapter about choosing a relational database architecture for web applications (skewed towards MySQL). There is a chapter on NoSQL databases by Eric Florenzano as well, but it is more introductionary-level.

What wasn’t so good? I didn’t get a lot of value out of John’s interview with Heather Champ, on community management and web operations. I did not think the interview format worked well in a book full of essays. But that might just be me. Also, a couple of places in two or three chapters felt a bit rant-ish without a lot of clear actionable advice; I think readers won’t get so much out of this.

Overall, though, this is a great book, badly needed, on a topic that is simply not yet recognized for its true importance. As Theo writes, we’re seeing the emergence of web operations as a very large profession; it’s one whose definition is not yet formalized or agreed-upon, but that’ll change. It’s too important not to. Jesse’s introduction repeats this sentiment: the world now relies on the web, and so the world relies also on the engineers who make it run. Web operations is work that matters.

Related posts:

  1. My chapter in the forthcoming Web Operations book
  2. Review of Scalable Internet Architectures by Theo Schlossnagle
  3. A review of Cloud Application Architectures by George Reese
  4. A review of Understanding MySQL Internals by Sasha Pachev
  5. A review of Optimizing Oracle Performance by Cary Millsap

How I keep track of notes

Sun, 04/07/2010 - 06:58

This is the follow-up to my post on how I keep track of tasks. It’s important for me to have a good system for keeping notes and other files organized. The problem usually turns out to be that I want them organized several different ways simultaneously: by date, by project, by person, by subject. Alas, if I keep them in files on a hard drive, I can only choose one such organizing strategy, because filesystems are a single hierarchy.

I choose to organize by date, simply because most of the time I need access to notes and files about things I’m working on now or recently. If I need to find files by project or subject, there’s a search feature in my file browser, and it works really well! So date-organization is good enough for me.

Inside my home directory, I have a directory per year, and inside that, a directory per month. If I write a note today, it goes into the $HOME/etc/2010/07/03/ directory. The filename starts with today’s date. That’s the simple organizing principle behind my note system. It also lets me eventually move things off my computer into permanent storage, so I don’t have to keep backing things up forever and carrying around infinite amounts of data. I keep the last couple of years; if I need access to notes or projects from 2006, I can go pull a hard drive off the shelf and pop it into my hard drive dock (buy one of those, and you’ll never get ripped off again by external drives with their own enclosures and power supplies).

I still need a quick way to create files and place them there, or move them there after I create them. For creating files, I use Vim. There is nothing better than a plain-text editor for me. My Vim settings are such that if I begin a line with a hyphen, Vim keeps nice indentation for me, making it easy to take notes in bulleted lists with proper indentation. If you’re on a call with me and you hear typing, I’m probably taking notes into Vim.

But it’s a pain to type out the full path to the file including the year, month, and date. So I created some helper scripts and put them into my $PATH. The most important are ‘t’ and ‘c’. ‘t’ simply uses Vim to edit a file. (It also creates any required directories, based on today’s date.) So if I am on a call with Joe, I just type ‘t joe’ into a terminal, and I’m editing /home/baron/etc/2010/07/03-joe.txt.

The ‘c’ tool cats the file’s contents. If I type ‘c joe’, it executes ‘cat /home/baron/etc/2010/07/03-joe.txt’. This makes it easy to grep, copy and paste, and so on.

There are a few more tools: the ‘m’ tool moves any file into the date-based hierarchy, so if I save a PDF of an order-confirmation page, for example, I can then ‘m’ it and it goes into its proper place. And I have a few tools to list files I created today, yesterday, this week, and this month.

I have a very important convention: when I’m taking notes and something becomes my responsibility to follow up on, I type TODO in the notes. After the call ends, I can grep for TODO in the file and quickly transfer the item into the task system I described in the post linked from above. This is how I can be confident that I’m not forgetting anything I’m supposed to do: I take notes and write it out as it happens, and then review the notes afterwards.

All told, this system kind of feels too simple to be a system. Everyone else seems to use complicated online gizmos named after groceries, or whizbang apps created by 37Signals, but I’ve found none of them to meet my needs, and just went back to basics. Basic is good. Basic works. Basic lets me concentrate on what I’m doing.

As I said in my previous post, part of this is based on the GTD book, which I read through a couple of times (with a year in between) and picked the parts that made sense to me. I think it’s a useful book to read, if you’re having trouble organizing yourself. I would just caution against spending all your energy getting organized — leave a little energy for actually doing your work!

Related posts:

  1. How I keep track of tasks
  2. How to track what owns a MySQL connection
  3. How to make file names cross-platform
  4. Interactive directory merging
  5. Windows XP’s built-in unzipping functionality is not trustworthy

The new hotness in open-core: InnoDB

Sat, 03/07/2010 - 03:58

There’s lots of buzz lately about the so-called “open-core” business model of Marten Mickos’s new employer. But this is nothing new. Depending on how you define it, InnoDB is “open-core,” and has been for a long time. The InnoDB Hot Backup (ibbackup) tool was always closed-source. Did anyone ever cry foul and claim that this made InnoDB itself not open-source, or accuse Innobase / Oracle of masquerading as open-source? I don’t recall that happening, although sometimes people got suspicious about the interplay between the backup tool and the storage engine. Generally, though, the people I know who use InnoDB Hot Backup have no gripes about paying for it.

What is the difference between open-source with closed-source accessories, and crippleware? I think it depends on how people define the core functionality of software. Some might say that backup is core functionality for a database; and others would point to mysqldump and say that InnoDB isn’t crippleware as long as there is some alternative.

I think InnoDB is an interesting case that illustrates what can happen when commercial and GPL play together. Part of that story is the appearance of XtraBackup, an open-source competitor to InnoDB Hot Backup. Everyone’s subject to the rules of the game, unless they restrict the “core,” which would make it non-open-source to begin with.

Related posts:

  1. Does MySQL really have an open-source business model?
  2. What does an open source sales model look like?
  3. MySQL: Free Software but not Open Source
  4. Making Maatkit more Open Source one step at a time
  5. Growth limits of open-source vis-a-vis MySQL Toolkit

Is Maatkit notable enough for Wikipedia now?

Fri, 02/07/2010 - 22:45

The Maatkit article on Wikipedia was removed some time ago, after being deemed not notable. I believe this is no longer the case. It’s hard to find a credible book published on MySQL in the last few years that doesn’t mention Maatkit, there’s quite a bit of blogging about it from MySQL experts and prominent community members, and the toolkit is certainly in wide use — it’s important enough that notable companies are supporting its development. It’s available through every major Unix-like operating system’s package repository. On Debian, it’s actually part of the mysql-client package, so if you install MySQL, you automatically get Maatkit too. I believe it’s probably the second most important set of MySQL command-line tools; the most important, of course, is the set of client applications that is included with MySQL itself.

But my opinion on this topic is beside the point. I’m the creator, and I’m biased. The Maatkit Wikipedia article should be created by independent people, not the project’s founder. If you think that Maatkit belongs on Wikipedia, I encourage you to help write that article.

Related posts:

  1. Where do you use Maatkit in real life?
  2. Get Maatkit fast from the command line
  3. Maatkit in RHEL and CentOS
  4. New Maatkit release policy
  5. Maatkit version 3329 released

How I keep track of tasks

Thu, 01/07/2010 - 12:24

I use a super-simple system for keeping track of tasks that are mine personally to manage. I use issue-tracking systems for software projects and consulting work, but there is still a bunch of work-related and personal work that I need to make sure I don’t forget.

The main point is not to ensure that I don’t forget, actually. It is to be able to put it out of my mind with confidence that I won’t lose it. I have a crowded mind, and the cleaner I can keep it, the better.

My system has three parts: my pockets, my notepad, and a directory on my computer.

In my front pocket I have a ballpoint pen. Currently it says Holiday Inn on the side. In my back pocket, I have a small piece of paper — usually about half the size of a standard letter paper, folded small enough to fit. It might be a used envelope, or a napkin, or a piece of actual notebook paper. I write down everything that matters to me. If I hear a song on the radio and I think my wife will like it, I write down some key lyrics I can search later, such as “your arms are my castle, your heart is my sky.” I write down anything I feel guilty about not doing, or neat ideas about stuff I could do, or whatever occurs to me. The goal is to write it down and trust that it’s now permanently in the system, then clear my head.

I do much the same thing with my notebook. I tend to pick these things up at conferences. I use two or three pages a week. A small size, like the size of a paperback book, is best. Legal pads are too big. One of the best pads I ever got was an InnoDB pad. I keep one page for random whatever-comes-to-me. At the beginning of each week, I collate these items; some of them I move off to the directory on my computer, others go into a single page, grouped by importance or topic as I see fit, in the notebook. The page needs to fit everything I’ll do that week. There’s no way I can do more than a page’s worth of things in a week. Typically about half the page is carried over to next week. (I just cross things off as I complete them or move them to the clean page.) This ensures that all of the important and/or urgent things are easy for me to reference, without a bunch of other stuff intruding. I also write down things I do that aren’t in my list — if I jump in and help out on a project, for example, I’ll write that down and then cross it off. This is a good record for my weekly report.

I just came back from a conference, so there are pages and pages of thoughts stimulated from conversations, people to follow up with, thank-you notes to send, and so on. A lot of this is going to be easy to take care of: I’ll just do it if it takes only a second, or move it to my computer for later followup. After I collate and organize, I tear out the old pages, feed them into my weekly report, and throw them away. They are redundant.

In my computer’s home directory, I have a directory to hold text files. Here I hold medium-term and long-term items, things that I want to do “someday” or reference material, project notes, and so on. I name each text file by topic, and there are dozens. I keep these as simple and few as possible. There’s one for music, for example. I looked up those lyrics, and then put the artist and album into todo-music.txt in my directory. Next time I decide to order a batch of CDs, I’ll refer to this list. The names of the files aren’t scientific — I just started out with what seemed right, and changed as I saw the need to; the current files have served well for a long time, so I think it’s stable and useful. I organize the lists in two sections: the top N priority items, and everything else. They are separated by a blank line. There is no need to be fancier, I find. Most things go into the everything-else category.

So in the end, my pockets and one page in the notebook are for capturing ideas as they come to me, another page is for what I decided to prioritize for the week, and the computer is the long-term spillover for things that need to get out of the notebook. This is a lot like levels of cache in a computer. I’m keeping the most important stuff in a compact way, easy to work with. And paper is definitely easier to work with than anything with an ON switch. I have no categories, sorting, tagging, hierarchies, or anything else like that. If it’s a single page the size of my hand, there’s no need.

This system was inspired by multiple attempts to use task lists on computers, personal organizers, the Getting Things Done system, the Seven Habits of Highly Effective People method, and many more. Name your favorite app or method — I’ve probably tried it or something like it. These days it’s pen, paper, and Vim. The system has been working for me for a couple of years with excellent results; I rarely to never forget anything. Although I may deprioritize it, which is effectively the same as saying I’ll never do it, I have the peace of mind that comes from knowing I have ten years or so of ideas I’ll never forget, should I ever find myself with ten years to spare and nothing to do with them.

The Getting Things Done system is very valuable in one specific way for me: capture everything and get it out of the head, to keep the head clear. I can’t overstate how important that is to me.

Next I’ll write about how I get things into the system in a way that lets me also have confidence I’m not losing track of something I’m taking on (or being asked to do).

Related posts:

  1. An easy way to run many tasks in parallel
  2. How to track what owns a MySQL connection
  3. How to add a wiki homepage, sidebar, and TOC in Google Code
  4. High Performance MySQL, Second Edition: Query Performance Optimization
  5. High Performance MySQL Second Edition Schedule

There’s a European OpenSQL Camp coming up

Tue, 29/06/2010 - 22:32

In addition to the Boston edition, there’s an OpenSQL Camp at the same time and place as FrOSCon mid-August in Germany. The call for papers is open until July 11th. As always, the conference is about all kinds of open-source databases: MySQL and PostgreSQL are only two of the obvious ones; MongoDB and Cassandra featured prominently at the last one I attended, and SQLite was well represented at the first one.

Related posts:

  1. OpenSQL Camp Boston 2010
  2. OpenSQL Camp events in 2009
  3. Recap of Portland OpenSQL Camp 2009
  4. The history of OpenSQL Camp
  5. OpenSQL Camp develops further

OpenSQL Camp Boston 2010

Fri, 25/06/2010 - 16:23

Sheeri and others are organizing another incarnation of OpenSQL Camp in October in Boston. You ought to go! It’s relevant to MySQL, PostgreSQL, SQLite, and lots of the newer generation of databases — MongoDB, Cassandra, and so on.

Related posts:

  1. There’s a European OpenSQL Camp coming up
  2. Going to OpenSQL Camp US 2009
  3. OpenSQL Camp events in 2009
  4. The history of OpenSQL Camp
  5. Recap of Portland OpenSQL Camp 2009

The little-known Maatkit man page

Sat, 19/06/2010 - 22:58

The Maatkit toolkit for MySQL has a lot of functionality that’s common across the tools. It’s not a good idea to document this in each tool’s man page, of course. So there is an overall maatkit man page. It explains concepts such as configuration file syntax. This and all the other Maatkit man pages are online.

Related posts:

  1. How PostgreSQL protects against partial page writes and data corruption
  2. Writing a book about Maatkit
  3. Learn about Maatkit at the MySQL Conference
  4. Maatkit version 1417 released
  5. Maatkit version 4334 released

Postmodern databases

Sun, 13/06/2010 - 03:29

Dr. Richard Hipp gave a talk at Southeast Linux Fest today on choosing an open-source database. He thinks that NoSQL is not a very good name for the new databases we’re seeing these days, so he proposed a new name: postmodern databases. Why postmodern?

  • The absence of objective truth
  • Queries return opinions, not facts

I thought this was the best proposal I’ve heard for an alternative to the NoSQL moniker. And this is not bashing — the absence of objective truth can actually be an enabling quality, not necessarily a drawback. There’s a lot to compliment about the new databases, and calling them NoSQL is really a disservice — like calling a car a horseless carriage.

Related posts:

  1. Observations on key-value databases
  2. On the unhelpfulness of NoSQL
  3. Why high-availability is hard with databases
  4. InnoDB is a NoSQL database
  5. How to write a good MySQL conference proposal

Southeast Linux Fest is around the corner

Thu, 03/06/2010 - 22:37

If you’re near South Carolina next weekend, consider attending Southeast Linux Fest! There’s a list of illustrious speakers including several well-known in the database world: Joshua Drake and Andrew Dunstan (PostgreSQL), D. Richard Hipp (SQLite), and yes, yours truly (MySQL), plus a MySQL name that’s new to me: Brandon Checketts. There are a ton of non-database sessions too! Check out the full speaker & session list. This was a great show last year; I highly encourage everyone to attend.

Related posts:

  1. Recap of Southeast Linux Fest 2009
  2. How Linux iostat computes its results
  3. How to find per-process I/O statistics on Linux
  4. How to monitor server load on GNU/Linux
  5. How to auto-mount removable devices in GNU/Linux