sed tricks

I helped a charity to rebuild a MySQL server and to restore a database with a lot of data of longblob type in the last two days. Fortunately there was a dump backup file for the database in question.

However, tables with longblob column(s) were not defined with “ROW_FORMAT=COMPRESSED”. I’d like to restore that database with row compression before inserting the data. Therefore I need to modify the dump sql file. The problem is that the file is 2.5 GB and the server only has 4 GB memory. So editing it is a challenge. Fortunately, Linux has sed to save the day. Don’t you love open source free software?

I am power Vi/Vim user, so I am familiar with sed and have used it in the past. But there are still a few things that I searched around for quick answers. So I’ll record noteworthy points here. I couldn’t remember how many times my own past blog entries helped me over the years. And I hope you’ll find this helpful too!

  • The -n switch is important. sed is a stream editor. In many cases you’d like to supress data streaming to stdout, and -n does that. This was especially important in my case, because a) the file is large, b) it contains blob that may or may not “fit to print”;
  • To see a particular line, say line a, use the p (print) command: sed -n 'ap' file
  • To see all lines between line a and b: sed -n 'a,bp' file
  • To see multiple, non-adjacent lines, say line a, e, g: sed -n 'ap;ep;gp' file
  • To edit big files, you’d like to make in-place changes. Therefore the -i switch. For example, to put in InnoDB row compression, this is the command I used: sed -i 's/CHARSET=utf8;/CHARSET=utf8 ROW_FORMAT=COMPRESSED;/' file
  • Similarly, to delete line a: sed -i 'ad' file You can also do range delete as well

By the way, when restore InnoDB database with a lot of blob data, it makes a lot of sense to enable the following settings in my.cnf, if they are not enabled already. It’ll make administration much easier down the road:
innodb_file_format = Barracuda

You may also need to tweak the max_allowed_packet and innodb_log_file_size parameters for successful restore.

Something else to pay attention to:
If you use:

mysql -uuser -p database < dump.sql to restore the database back, the program may report the wrong line where it had loading problems. In most cases, you need to search the surrounding lines to find where the problem is. Additionally, if you are in a hurry and want to load data in without troubleshooting loading issues, you can try adding -f switch to the command above so the restore ignores errors it encountered and jump to the next line.

Recommendation: Vagrant and Veewee

Note: I’ve decided not to use Veewee due to silly compatibility issues for now.

Quoting from Vagrant’s web site:

Vagrant is a tool for building and distributing virtualized development environments. By providing automated creation and provisioning of virtual machines using Oracle’s VirtualBox, Vagrant provides the tools to create and configure lightweight, reproducible, and portable virtual environments.

A complementary technology called Veewee makes building VirtualBox VMs easier by automating away a lot of manual steps. Marius Ducea has a great blog post on how to use it.

My observations:
1. According to Vagrant’s web site, it should work on Windows. I’ve tried that with unsatisficatory results. I think it works better on Linux and Mac;
2. Vagrant 0.9 was released recently, and is the version that I am using. However, 0.9 broke some backward compatibility such that a lot of examples on Vagrant configuration file in its web site no longer work. But it does provide good enough error message for you to fix and adjust;
3. Just a couple of weeks ago, Veewee was not compatible with Vagrant 0.9, but that has since been fixed. I mention this because things appear to be a bit fluid, and you may need to roll up your sleeves and tweak a few things yourself to get what you want;
4. Using Veewee to build Windows VM is not as smooth as doing the same for Linux VMs. In fact a few of the templates do not have the Cygwin setup file that definition.rb refers to. Another drawback is that the initial VM harddisk size is too small (10 gig I believe). Darn it, I forgot where I changed the setting to make it bigger before building it. Share that with me if you know please 🙂
5. In addition, it seems hard to build a Multi-VM environment with Windows in the mix. It looks possible, though. Once again please share that with me if you know!

Potential use cases:
1. Hacking, learning, and development. It provides a safe environment to play around, build and compile things, learn new technologies, etc. For example, I’ve used it to test and play with Tengine and Nginx. I’ve also built a Multi-VM environment where I tested MariaDB 5.3 and the new Percona Tools, pt-table-checksum and pt-table-sync in particular. I think Vagrant is a great tool to learn Chef, Puppet, and Hadoop;
2. Testing: testing software on different distros, continuous integration testing for server software (web server such as Tengine, database server such as MariaDB and Percona Server). In fact, using a Centos 6.2 VM, I found Percona’s 5.5 RPM packages still haven’t resolved header file conflicts, an old problem.

Sample Multi-VM Vagrant file
It is really convinient to have a Multi-VM environment for development and testing. Vagrant can do that but there is a shortage of good samples, especially for Vagrant 0.9. Here is a sample that worked for me. Note that:
1. The VMs in the environment can be made from the same image. In the sample below, “debian” and “xiaosaier”‘s daddy is “debian6”. No joke intended;
2. Use “vagrant ssh debian”, “vagrant ssh centos”, and “vagrant ssh xiaosaier” to get to each VM. Each VM can talk to one another via the IP address defined. Therefore it can be used for good, realistic testing. After having good Chef recipe or Puppet files, the VMs can be made in a certain way, which can be used for Continuous Integration Testing.
3. The hostonly network IP address works very well. However, the VM takes the name of the image it is based on by default. It will be nice to be able to customize the VMs’ names. Let me know if you know how; do |config|

config.vm.define :debian do |debian_config| = "debian6" :hostonly, ""

config.vm.define :centos do |centos_config| = "centos62" :hostonly, ""

config.vm.define :xiaosaier do |xiaosaier_config| = "debian6" :hostonly, ""

PS. Note, the combination of colon and letter x turns into an emotion icon above, at least viewed via the web page. Watch out!

A great way to test-drive MySQL from MariaDB, Oracle, and Percona

I was doing some research on Percona Server, and came across this great tip by Baron: if you are using Oracle’s MySQL and want to test out and learn new/improved features that are present in Percona Server, you can just stop the mysqld instance, extract Percona Server binary from its rpm/deb package or tarball file, swap the binary, and do a successful restart. You can then do a test drive, kick the tires, learn and observe to your heart’s content. Swap the original mysqld back after you are done, as necessary.

I tested it and it worked great for me. Specifically, I did the following:
1. sudo /etc/init.d/mysql(d) stop
2. sudo cp /usr/sbin/mysqld /location/mysqldFromOracleOrWhatever
3. sudo cp /perconaBinaryDirectory/mysqld /usr/sbin/mysqld
4. sudo /etc/init.d/mysql(d) start
5. Test drive
6. Swap the original mysqld back as necessary

You can tell which binary you are using by reading the product and version information upon successful connection from mysql client. You can also do “SHOW VARIABLES” and tell by noticing additional settings that are present in Percona Server.

It turns out you can do the same with MariaDB as well. In fact, I tested that over the weekend successfully. You need one extra step though. If you simply followed the steps above, here is what you will get while mysqld is trying to start:

Starting MySQL. ERROR! The server quit without updating PID file (/var/lib/mysql/

And here is the relavant info in error log:

120213 5:12:33 [ERROR] Error message file '/usr/share/mysql/english/errmsg.sys' had only 722 error messages, but it should contain at least 930 error messages.
Check that the above file is the right version for this program!
120213 5:12:33 [ERROR] Aborting

You see, Percona Server is more or less Oracle MySQL, plus Percona patches for InnoDB storage engine plugin, known as XtraDB. MariaDB’s change, on the other hand, is more widespread and substantial. It certainly has changes to the InnoDB storage engine plugin. After all, both MariaDB and Percona Server use the same XtraDB default storage engine. On top of that, MariaDB has additional changes to the MySQL server itself, query optimizer and replication come to mind. So it is not surprising that MariaDB’s errmsg.sys is different. I hope that compatibality can be maintained as things move forward, which is probably not too hard to do technically. In a perfect world, I wish to be able to combine both MariaDB and Percona’s strength, but I digress.

So before we can test drive MariaDB, not only do we need to swap mysqld, we also need to swap errmsg.sys. It’s always a good idea to keep a backup before it is overwritten. After starting, if you run “SHOW VARIABLES LIKE ‘optimizer_%’, you will notice that MariaDB has a lot more optimizer knobs than Percona Server or Oracle MySQL.

This actually gives me some ideas for benchmark testing. What will we see if we conduct the same benchmark tests on the same server, with only mysqld swapped for each set of benchmark tests?

PS. This reminds me that years back, my team wrote a VB6 application that controled lamination film production process. I wrote a utility called AppLauncher that checked the version of the .exe binary and did a replace if the production version is different from the one in the deployment folder. That binary swap bypassed all unnecessary installation package creation, testing, and distribution, provided that there was no changes to the supporting DLL files.

Generating dimension data for dates

Most analytical and BI databases have date dimension table(s). One frequently needs to generate and populate such data. I present a solution below for such data generation, written in Python. Please use different database drivers/modules to connect to your specific database server (MySQL, SQL Server, Oracle, etc.) for data population.


1. It takes 2 parameters, start date and end date, in YYYYMMDD format, inclusive. Extensive error checking is built in, but let me know if you have comments/suggestions;

2. The script produce a Python dictionary (associated array) and print out its content;

3. The output includes dayNumber: a day’s position in a year. For example, 2011-02-01 is the 32ed day in 2011, therefore its dayNumber is 32;

4. The output includes weekNumber: a week’s position in a year. The week number in year is based on ISO standard. From documentation: the ISO year consists of 52 or 53 full weeks, where a week starts on a Monday and ends on a Sunday. The first week of an ISO year is the first (Gregorian) calendar week of a year containing a Thursday. This is called week number 1, and the ISO year of that Thursday is the same as its Gregorian year.

So, 2011-01-01 has the weekNumber 52, because it falls on a Saturday and belongs to the last week of 2010.

5. The output includes weekday information as well. 4 different variations are included:
Sunday 0, Monday 1, and so on
Sunday 1, Monday 2, and so on
Monday 0, Tuesday 1, and so on
Monday 1, Tuesday 2, and so on

6. The script requires the argparse module. It comes with Python 2.7. Python version prior to 2.7 does not have it by default, therefore you need to install it.

import argparse, sys, time
from datetime import date, timedelta

parser = argparse.ArgumentParser(description=”Generating date dimension data”)
parser.add_argument(‘-s’, ‘–startDate’, help=’Start date in YYYYMMDD format’, required=True, dest=’startDate’)
parser.add_argument(‘-e’, ‘–endDate’, help=’end date in YYYYMMDD format’, required=True, dest=’endDate’)

argList = parser.parse_args()

if (((not argList.startDate.isdigit()) or (not (len(argList.startDate) == 8))) or ((not argList.endDate.isdigit()) or (not (len(argList.endDate) == 8))) or (argList.startDate > argList.endDate)):
print “Input(s) must be numeric in YYYYMMDD format and end date must not be earlier than start date”
sys.exit (1)

startDate = date(int(argList.startDate[0:4]), int(argList.startDate[4:6]), int(argList.startDate[6:8]))
endDate = date(int(argList.endDate[0:4]), int(argList.endDate[4:6]), int(argList.endDate[6:8]))
except ValueError:
print “Input(s) must be valid date value in YYYYMMDD format”
sys.exit (1)

start = time.time()

while startDate <= endDate: dateInfo = {'dateYYYYMMDD': startDate.strftime('%Y%m%d'), 'calDate': startDate.strftime('%Y-%m-%d'), 'calDay':, 'calMonth': startDate.month, 'calYear': startDate.year} dateInfo['dayOfWeekSunday0Monday1'] = startDate.isoweekday() % 7 dateInfo['dayOfWeekSunday1Monday2'] = startDate.isoweekday() % 7 + 1 dateInfo['dayOfWeekSunday6Monday0'] = startDate.weekday() dateInfo['dayOfWeekSunday7Monday1'] = startDate.isoweekday() dateInfo['dayNumber'] = startDate.toordinal() - date(startDate.year - 1, 12, 31).toordinal() dateInfo['weekNumber'] = startDate.isocalendar()[1] print dateInfo startDate = startDate + timedelta(1) [/sourcecode]

A comparison of HandlerSocket and mysql client libraries with Python

I’ve done some benchmark testing of 2 Python modules for MySQL data retrieval: MySQLdb and pyhs. MySQLdb uses MySQL’s client libraries, whereas pyhs uses HandlerSocket that bypasses MySQL’s client layer and interfaces Innodb storage engine’s files directly. In my testing, HandlerSocket results in 82% improvement over mysql client libraries based on number of rows retrieved. The tests were conducted under different conditions: right after a start when cache is cold, a warmed up cache after running SELECT * FROM customer, and alternating the execution order of those 2 Python files. The results are fairly consistent in that they all fall in the same range. Below is a sample output.

root@ubuntu:~# python 
Using HandlerSocket, below is a report of how many customer's name and address can be retrieved based on customer key:
Seconds elapsed:  61.0000810623
Rows retrieved:  509863
root@ubuntu:~# python 
Using mysql client libraries, below is a report of how many customer's name and address can be retrieved based on customer key:
Seconds elapsed:  61.0001530647
Rows retrieved:  280120

Here is my setup:

Hardware and software:
1. Rackspace Cloud server Ubuntu 10.04 Lucid Lynx, 1 gig memory, 40 gig hard disk space, 64-bit
Linux ubuntu #8 SMP Mon Sep 20 15:54:33 UTC 2010 x86_64 GNU/Linux

2. Following instruction here to get Percona’s APT repository;

3. apt-get install percona-server-client-5.5

4. apt-get install percona-server-server-5.5

Enable HandlerSocket plugin. HandlerSocket is bundled with Percona Server 5.5, so you don’t have to download source files, config, make, build yourself:

1. mysql> install plugin handlersocket soname ''

2 cp /usr/share/mysql/my-large.cnf /etc/mysql/my.cnf

3. vim /etc/mysql/my.cnf with the following under mysqld section

loose_handlersocket_port = 9998
loose_handlersocket_port_wr = 9999
loose_handlersocket_threads = 16
loose_handlersocket_threads_wr = 1
open_files_limit = 65535

4. service mysql restart

Acquire Python’s MySQLdb and pyhs modules:

1. apt-get install libmysqlclient-dev
Necessary for building Python’s MySQLdb

2. apt-get install python-dev
Necessary Python header files, gcc, make, etc., for building Python modules

2. wget the appropriate egg from this page. Get the one for your version of Python.

3. sh eggFileDownloadedFromTheStepAbove

4. easy_install MySQL-python
MySQLdb module, which uses mysql client for MySQL access

5. easy_install python-handler-socket

Prepare testing data

1. Follow instructions here to get dbgen compiled;

2. While at the proper directory, run
./dbgen -v -T c
It will generate a customer file that has 150000 rows

3. Create customer table in test. Here is the DDL:

primary key (C_CUSTKEY));

4. load data local infile ‘/root/dbgen/customer.tbl’ into table customer fields terminated by ‘|’ lines terminated by ‘\n’;
Adjust file location as necessary.

Finally, here is the content of my throwaway Python test scripts. One highlight of is it demonstrate how to return results back in as dict with MySQLdb.

import time
from pyhs import Manager

hs = Manager()

start = time.time()
i = 1
j= 0
while i < 150000:
	data = hs.get('test', 'customer', &#91;'C_CUSTKEY', 'C_NAME', 'C_ADDRESS'&#93;, '%s' % i)
	if i == 150000:
		i = 1
	end = time.time()
	j = j + 1
	if int(end - start) > 60:

print "Using HandlerSocket, below is a report of how many customer's name and address can be retrieved based on customer key:"
print "Seconds elapsed: ", str(end - start)
print "Rows retrieved: ", str(j)

import sys, MySQLdb, time

my_host = "localhost"
my_user = "root"
my_pass = ""
my_db = "test"

    db = MySQLdb.connect(host=my_host, user=my_user, passwd=my_pass, db=my_db)
except MySQLdb.Error, e:
     print "Error %d: %s" % (e.args[0], e.args[1])
     sys.exit (1)

cursor = db.cursor (MySQLdb.cursors.DictCursor)
start = time.time()
while i < 150000:
	sql = "select c_custkey, c_name, c_address from customer where c_custkey=%s" % i;
	results = cursor.fetchall()
	if i==150000:
	end = time.time()
	if int(end - start) > 60:
print "Using mysql client libraries, below is a report of how many customer's name and address can be retrieved based on customer key:"
print "Seconds elapsed: ", str(end - start)
print "Rows retrieved: ", str(j)