LexisNexis open sources code for Hadoop alternative

Wikileaks Cable Offers New Insights Into Oracle-Sun Deal | PCWorld Business Center

A Stick Figure Guide to the Advanced Encryption Standard (AES)

On Password Strength

XKCD (as usual) makes a very good point – this time about password strength, and I reckon it’s something app developers need to consider urgently. Geeks can debate the exact amount of entropy, but that’s not really the issue: insisting on mixed upper/lower and/or non-alpha and/or numerical components to a user password does not really improve security, and definitely makes life more difficult for users. So basically, the functions that do a “is this a strong password” should seriously reconsider their approach, particularly if they’re used to have the app decide whether to accept the password as “good enough” at all. Update: Jeff Preshing has written an xkcd password generator. Users probably should choose their own four words, but it’s a nice example and a similar method could be used by an app to give “password suggestions” that are still safe.

HDlatency – now with quick option

I’ve done a minor update to the hdlatency tool (get it from Launchpad), it now has a –quick option to have it only do its tests with 16KB blocks rather than a whole range of sizes. This is much quicker, and 16KB is the InnoDB page size so it’s the most relevant for MySQL/MariaDB deployments. However, I didn’t just remove the other stuff, because it can be very helpful in tracking down problems and putting misconceptions to rest. On SANs (and local RAID of course) you have things like block sizes and stripe sizes, and opinions on what might be faster. Interestingly, the real world doesn’t always agree with the opinions. We Mark Callaghan correctly pointed out when I first published it, hdlatency does not provide anything new in terms of functionality, the db IO tests of sysbench cover it all. A key advantage of hdlatency is that it doesn’t have any dependencies, it’s a small single piece of C code that’ll compile on or can run on very minimalistic environments. We often don’t control what the base environment we have to work on is, so that’s why hdlatency was initially written. It’s just a quick little tool that does the job. We find hdlatency particularly useful for comparing environments, primarily at the same client. For instance, the client might consider moving from one storage solution to another – well, in that case it’s useful to know whether we can expect an actual performance benefit. The burst data rate (big sequential read or write) which often gets quoted for a SAN or even an individual disk is of little interest to database use, since its key performance bottleneck lies in random access I/O. The disk head(s) will need to move. So it’s important to get some real relevant numbers, rather than just go with magic vendor numbers that are not really relevant to you. Also, you can have a fast storage system attached via a slow interface, and consequentially the performance then will not be at all what you’d want to see. It can be quite bad. To get an absolute baseline on what are sane numbers, run hdlatency also on a local desktop HD. This may seem odd, but you might well encounter storage systems that show a lower performance than that. ‘nuf said. If you’re willing to share, I’d be quite interested in seeing some (–quick) output data from you – just make sure you tell what storage it is: type of interface, etc. Simply drop it in a comment to this post, so it can benefit more people. thanks

Slides from DrupalDownUnder2011 on Tuning for Drupal

By popular request, here’s the PDF of the slides of this talk as presented in January 2011 in brisbane; it’s fairly self-explanatory. Note that it’s not really extensive “tuning”, it just fixes up a few things that are usually “wrong” in default installs, creating a more sane baseline. If you want to get to optimal correctness and more performance, other things do need to be done as well.

Open Query, new on Fifth Ave

Some of you already know since you helped us move, we recently shifted Open Query’s main office to Fifth Avenue, next door to Elizabeth’s. The new place is comfortable, I really like it so far. Anna is also happy with her new admin space and cat Figaro has found an empty spot on a bookshelf to stretch out on! The lease costs are a bit steep, as is common these days… chances are we’ll just buy our next place.
Follow-Up yes this was an April 1st post. But, everything in the above post is the truth, it’s just phrased to be very open for a bit of mis-interpretation ;-) I find that the real world provides plenty of fun and unbelievable yet true tidbits, so why bother making up nonsense!

MySQL data backup: going beyond mysqldump

A user on a linux user group mailing list asked about this, and I was one of the people replying. Re-posting here as I reckon it’s of wider interest. > [...] tens of gigs of data in MySQL databases. > Some in memory tables, some MyISAM, a fair bit InnoDB. According to my > understanding, when one doesn’t have several hours to take a DB > offline and do dbbackup, there was/is ibbackup from InnoBase.. but now > that MySQL and InnoBase have both been ‘Oracle Enterprised’, said > product is now restricted to MySQL Enterprise customers.. > > Some quick searching has suggested Percona XtraBackup as a potential > FOSS alternative. > What backup techniques do people employ around these parts for backups > of large mixed MySQL data sets where downtime *must* be minimised? > > Has your backup plan ever been put to the test? You should put it to the test regularly, not just when it’s needed. An untested backup is not really a backup, I think. At Open Query we tend to use dual master setups with MMM, other replication slaves, mysqldump, and XtracBackup or LVM snapshots. It’s not just about having backups, but also about general resilience, maintenance options, and scalability. I’ll clarify:
  • XtraBackup and LVM give you physical backups. that’s nice if you want to recover or clone a complete instance as-is. But if anything is wrong, it’ll be all stuffed (that is, you can sometimes recover InnoDB tablespaces and there are tools for it, but time may not be on your side). Note that LVM cannot snapshot between multiple volumes consistently, so if you have your InnoDB ibdata/IBD files and iblog files on separate spindles, using LVM is not suitable.
  • mysqldump for logical (SQL) backups. Most if not all setups should have this. Even if the file(s) were to be corrupted, they’re still readable since it’s plain SQL. You can do partial restores, which is handy in some cases. It’ll be slower to load so having *only* an SQL dump of a larger dataset is not a good idea.
  • some of the above backups can and should *also* be copied off-site. that’s for extra safety, but in terms of recovery speed it may not be optimal and should not be relied upon.
  • having dual masters is for easier maintenance without scheduled outages, as well as resilience when for instance hardware breaks (and it does).
  • slaves. You can even delay a slave (Maatkit has a tool for this), so that would give you a live correct image even in case of a user error, provided you get to it in time. Also, you want enough slack in your infra to be able to initialise a new slave off an existing one. Scaling up at a time when high load is already occurring can become painful if your infra is not prepared for it.
A key issue to consider is this… if the dataset is sufficiently large, and the online requirements high enough, you can’t afford to just have backups. Why? Because, how quickly can you deploy new suitable hardware, install OS, do restore, validate, put back online? In many cases one or more aspects of the above list simply take too long, so my summary would be “then you don’t really have a backup”. Clients tend to argue with me on that, but only fairly briefly, until they see the point: if a restore takes longer than you can afford, that backup mechanism is unsuitable. So, we use a combination of tools and approaches depending on needs, but in general terms we aim for keeping the overall environment online (individual machines can and will fail! relying on a magic box or SAN to not fail *will* get you bitten) to vastly reduce the instances where an actual restore is required. Into that picture also comes using separate test/staging servers to not have developers stuff around on live servers (human error is an important cause of hassles). In our training modules, we’ve combined the backups, recovery and replication topics as it’s clearly all intertwined and overlapping. Discussing backup techniques separate from replication and dual master setups makes no sense to us. It needs to be put in place with an overall vision. Note that a SAN is not a backup strategy. And neither is replication on its own.

Importing a file dumped from MySQL with mysqldump into drizzle

As a big fan of new technology, we try to keep up to date with what’s happening in the industry. As such, I decided to start using drizzle on my development machine since they announced GA this week.
First exercise: import a file dumped from a MySQL server I don’t have access to into drizzle. Normally, you can use drizzledump on the mysql server and make it dump a drizzle compatible file. Not in this case, so I decided to sed my way through the various errors. Not pretty, and I hope that at some point we’ll have a tool that can convert a mysqldump into a drizzle compatible file, but it works for now.
Here’s what I had to do. Note that this is by no means complete or comes with any guarantees, it’s just a starting point.
# This file started by setting a SQL_MODE. That doesn't exist in 
# drizzle, so we comment it out
sed -i "s/^SET SQL_MODE/#SET SQL_MODE/g" mysqldump.sql 

# The create database statement set a default character set. 
# Everything in drizzle is UTF8, so let's lose it!
sed -i "s/DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci//g" mysqldump.sql 

# The table definitions mentioned a default character set. 
# Everything in drizzle is UTF8, so let's lose it!
sed -i 's/DEFAULT CHARSET=utf8//g' mysqldump.sql 

# No MyISAM except for temporary tables, so away with it.
sed -i 's/ENGINE=MyISAM//g' mysqldump.sql 

# Invalid timestamps are not accepted in drizzle, so this should be a null 
# value. Since some of the columns in this file are actually NOT NULL defined, 
# for now I just set those dates to 1970. UGLY, but works for me. Don't do this 
# on anything that will ever go anywhere near production though!
sed -i "s/'0000-00-00/'1970-01-01/g" mysqldump.sql 

# tinyint doesn't exist anymore, so just replace with integer. Note that you'll 
# have to do this for all data types that no longer exist in drizzle
sed -i "s/tinyint(.*)/integer/g" mysqldump.sql
Hope this helps others!

PayPal & decisions on acceptable use

As you may know Open Query uses PayPal for some of its financial transactions. I filed the following question with PayPal support. Note that with this question I regard it as completely immaterial whether one supports of Julien Assange or WikiLeaks actions or not. Naturally companies have the right to choose which clients to serve, but in this case they did cite a specific clause as the reason for cancellation and I don’t see how it applies. Also, PayPal is close to an effective monopoly in its sphere of operation, and that too comes with consequences and responsibilities. Anyway, the letter is below – naturally I’ll also post PayPal’s response. === As a business client of PayPal, I would like to inquire what PayPal’s decision making process is regarding violation of its “Acceptable Use” Policy. I refer specifically to the published (https://www.thepaypalblog.com/2010/12/paypal-statement-regarding-wikileaks/) permanent restriction of the WikiLeaks account, quoting “… payment service cannot be used for any activities that encourage, promote, facilitate or instruct others to engage in illegal activity.” Considering that at this point in time no charges have been laid against WikiLeaks for any of its activities, in any country, and not even a subpoena has seen the light of day, what basis did PayPal use for this decision? Surely if there is illegal activity, this is typically indicated by a criminal conviction or at least a subpoena related to a prosecution. I find it extremely worrying, and would appreciate some more clarity as this may our continued use of your service. Naturally companies have the right to choose which clients to serve, but in this case PayPal did cite a specific clause as the reason for cancellation and I don’t see how it applies. Obviously, we would prefer to not conduct business through an organisation which may at any point cancel our service for essentially arbitrary reasons which may include political disagreement, lobbying by third parties, or other forms of pressure. It would be great to narrow down that list of possibilities, ideally to 0. The law exists for a reason, it protects us all. Companies can’t go about playing judge&jury. Awaiting your response, Regards, Arjen Lentz Exec.Director, Open Query === PayPal’s initial response
Thank you for contacting PayPal in regards to our acceptable use policy and our decision made on the account for WikiLeaks. For security reasons we cannot discuss any details of a PayPal account with a third party. The status of a PayPal account can only be discussed with the account holder to ensure that sensitive account information is not disclosed. For more information regarding WikiLeaks donations, we advise you to contact the organization called Wau Holland used to raise funds for WikiLeaks. To learn more about the Acceptable Use Policy, please refer to our Help Centre and the Legal Agreements section on the PayPal website.
=== My reply Thank you for your reply. The documents you’re referring to explicitly do not answer my question, which is why I asked. I am inquiring about other publically available information, from you, and asking for clarification: PayPal made a public statement about cancelling Wau Holland’s PayPal service, quoting a specific sentence from the acceptable use policy. It’s at https://www.thepaypalblog.com/2010/12/paypal-statement-regarding-wikileaks/ So, both the fact that an account was cancelled and some aspect of the reasoning was made public by you. So all I am asking is clarification of your own this public statement, as it affects me as a PayPal client. The sentence you refer to in the publication is “… payment service cannot be used for any activities that encourage, promote, facilitate or instruct others to engage in illegal activity.” and such matters are, by nature, a matter of public record also. Therefore, it should be no problem at all for you to simply tell me what illegal activity you are referring to. Did they promote, facilitate or instruct? Which one is it? And which illegal activity? As you’ll be aware, what is “illegal” is defined by law; if the law has been broken, charges can be filed – then of course there is a presumption of innocence, but if a case is proven and a conviction is made, then naturally action is taken. No charges have been filed against Wau Holland, thus PayPal appears to have no legal basis for its action – it amounts to arbitrary choice and not actually based on the stated Acceptable Use clause. As I mentioned in my original question, of course a company has the right to choose its customers – that’s why it has terms, conditions, acceptable use policies, etc. It makes it clear to would-be-customers what can be expected. If you put in a clause “we reserve the right to cancel any client’s service at any point at our discretion” then indeed you can do what you like, and clients know what to expect. It’s clear. However, that’s not what you did. In this case your actions don’t appear to jive with your own published policies, and your initial response to me does not inspire any more. The issue in a way has nothing to do with WikiLeaks or Wau Holland and whether their actions are likable or not. But what it appears to mean for PayPal clients: if a malicious third party contacts PayPal or PayPal itself “feels like it”, my account may be suspended even if I did not break any stated PayPal policy. As a person, and entrepreneur, this worries me greatly. It becomes a matter of PayPal being predictable and trustworthy as a business partner. I think that is quite worthy of a more constructive and comprehensive response from the side of PayPal. thanks Regards, Arjen. === Their reply (9 Dec) Thanks for contacting PayPal. I appreciate the opportunity to assist you with your questions. (yes that was the entire reply) === My reply (9 Dec) yes? I too would appreciate that [assisting me with my questions], but you haven’t so far. The above text was the only bit in your email, not actually addressing my questions. Looking forward to your proper reply. thanks === No further correspondence was received. On the same day though, this appeared in the press: Caving to pressure from supporters, PayPal releases WikiLeaks’ funds Well, that’s something, But doesn’t actually address the questions I raised, which were of a generic business nature and not restricted to the WikiLeaks issue. The issue is that PayPal appears to be an unpredictable business partner, not adhering to its own terms of service when (for whatever reason) it sees fit. That is a serious problem.