Archive for Technology

A great way to test-drive MySQL from MariaDB, Oracle, and Percona

I was doing some research on Percona Server, and came across this great tip by Baron: if you are using Oracle’s MySQL and want to test out and learn new/improved features that are present in Percona Server, you can just stop the mysqld instance, extract Percona Server binary from its rpm/deb package or tarball file, swap the binary, and do a successful restart. You can then do a test drive, kick the tires, learn and observe to your heart’s content. Swap the original mysqld back after you are done, as necessary.

I tested it and it worked great for me. Specifically, I did the following:
1. sudo /etc/init.d/mysql(d) stop
2. sudo cp /usr/sbin/mysqld /location/mysqldFromOracleOrWhatever
3. sudo cp /perconaBinaryDirectory/mysqld /usr/sbin/mysqld
4. sudo /etc/init.d/mysql(d) start
5. Test drive
6. Swap the original mysqld back as necessary

You can tell which binary you are using by reading the product and version information upon successful connection from mysql client. You can also do “SHOW VARIABLES” and tell by noticing additional settings that are present in Percona Server.

It turns out you can do the same with MariaDB as well. In fact, I tested that over the weekend successfully. You need one extra step though. If you simply followed the steps above, here is what you will get while mysqld is trying to start:

Starting MySQL. ERROR! The server quit without updating PID file (/var/lib/mysql/localhost.localdomain.pid).

And here is the relavant info in error log:

120213 5:12:33 [ERROR] Error message file '/usr/share/mysql/english/errmsg.sys' had only 722 error messages, but it should contain at least 930 error messages.
Check that the above file is the right version for this program!
120213 5:12:33 [ERROR] Aborting

You see, Percona Server is more or less Oracle MySQL, plus Percona patches for InnoDB storage engine plugin, known as XtraDB. MariaDB’s change, on the other hand, is more widespread and substantial. It certainly has changes to the InnoDB storage engine plugin. After all, both MariaDB and Percona Server use the same XtraDB default storage engine. On top of that, MariaDB has additional changes to the MySQL server itself, query optimizer and replication come to mind. So it is not surprising that MariaDB’s errmsg.sys is different. I hope that compatibality can be maintained as things move forward, which is probably not too hard to do technically. In a perfect world, I wish to be able to combine both MariaDB and Percona’s strength, but I digress.

So before we can test drive MariaDB, not only do we need to swap mysqld, we also need to swap errmsg.sys. It’s always a good idea to keep a backup before it is overwritten. After starting, if you run “SHOW VARIABLES LIKE ‘optimizer_%’, you will notice that MariaDB has a lot more optimizer knobs than Percona Server or Oracle MySQL.

This actually gives me some ideas for benchmark testing. What will we see if we conduct the same benchmark tests on the same server, with only mysqld swapped for each set of benchmark tests?

PS. This reminds me that years back, my team wrote a VB6 application that controled lamination film production process. I wrote a utility called AppLauncher that checked the version of the .exe binary and did a replace if the production version is different from the one in the deployment folder. That binary swap bypassed all unnecessary installation package creation, testing, and distribution, provided that there was no changes to the supporting DLL files.

Comments

ALTER LOGIN after Windows user or group name has been changed

If a Windows AD group or user has been renamed, and if that group or user was granted access to SQL Server in the past, then you can use ALTER LOGIN to rename the login inside of SQL Server:

ALTER LOGIN [myDomain\oldName] WITH NAME = [myDomain\newName]

It is not necessary to adjust user names in SQL Server databases that this login has access to, but you may want to do it just for consistency. Here is the command to do that:

ALTER USER [myDomain\oldName] WITH NAME = [myDomain\newName]

Note that renaming an AD user or group does not change its SID. You can check an AD user or group’s SID with psgetsid, part of the very handy Sysinternal tool suite.

Comments

Velocity 2011中国行随记

这次到北京参加Velocity中国大会,感觉很不错。记录下自己的感想和体会。

参加任何会议,对我来说最有意义的是和参会者的互动与交流:业界的新发展,某些技术的实战经验,网上和网下的资源,好书好网站的推荐等。很多灵感都是在谈话中通过思维火花的碰撞而产生。还有一部分干货是谈话者有意无意中透露出来。这就需要听者有开放的视野和耳朵,懂得聆听,不打断别人的话语(特别是在关键时刻),记下这些小金块:有用的工具啦,一些参数的设定啦,实战中碰到的问题和解决办案啦,提高工效的技巧啦,很有用的网站和论坛的帖子啦,等等。记下这些东西后,注意不要把它们遗忘,要跟踪和研究。因为趁热打铁,凭着这股子热乎劲儿才能把那转变成对自己和公司有用的东西,才能跳到更高的层次。有时参加大会后能收到一个这样的小金块就够本甚至有盈余。所以作为管理人员,在给下属买书和参加会议上,不要吝啬猴精。说到这里,如果你是个管理人员,你有没有拨出资金给员工买书?如果有,恭喜你,因为我感觉这是一个非常值得自豪的东西!那就再进一步,你有没有在搞好财务的同时让报销的流程更容易?你有没有在保证工效的同时给下属提供工作和生活上的方便?

在这同时,也要去回馈。懂得回馈,懂得提携和帮助同仁和后来人,才能获得人脉资源,并且自己也可以在其中得到灵感和启发。因为解释和分享一些东西的时候,也是自己学习的时候。当你把东西用口头语言表述出来给同仁的时候,这本身就是一个非常有效的思索过程。你能把事儿说圆吗?如果不能,为什么?是不是自己理解得还不够透彻?在这过程中,别人甚至你自己会突然意识到为什么没从这样或那样的层面和角度来研究和尝试这个东西呢?同仁和后进的提问也能产生很好的启发。另外,这也是我坚信的一点,让自己利益最大化的最好办法是不自私,不信你在生活和工作中试试看。自以为高明,自以为是牛人,自以为别人和大多数人比你低下,玩儿清高装逼的那一套,这种态度,只会带来坏处,不会有丝毫的好处。关于回馈和帮助提携他人,我写过一个纪念我的同仁Ken Henderson的文章,引用了他的一个aging champion syndrome(我觉得可以翻译成过气冠军症)的短文。他讲得非常透彻和精辟!英语爱好者不妨注意下,Ken写的东西很值得学习。

以上都是个人层面上的东西。提高到公司和企业的角度,也是同样的道理。现代的互联网公司需要一个开放和互动的平台。而作为公司来参与这个平台的构建并保持和发展其活力对公司本身大有益处:公司本身的人气、内部员工的士气、和在业界的口碑,并且在这个平台里也会有高质量的人力资源。

从互动、交流、开放、分享的角度来看这次Velocity会议,我感觉组织者做得非常成功。淘宝和淘宝的员工们和O’Reilly投入资金、时间、和人力资源来办Velocity是一个非常有意义且值得称道的事情。这种开放和互动的平台需要各种大小公司的参与,进而形成一个良性循环,这样我们就可以把事情做大、做活。水涨船高,众人拾柴火焰高,不就是这个道理吗?现在淘宝是这个活动领头人,但希望百度、腾讯、网易、新浪、华为、谷歌中国、雅虎中国、和微软中国等其它公司一起加入到这个行列来。

我在会上做了个基本的关于InnoDB状态的分享,并且也参加了几个演讲。英文主题演讲时,有中文的同声翻译。我看到有估计不到1/3的人用同声翻译的耳机。很遗憾,我没来得及试听一下。我倒是找到机会试听了下中译英的同声翻译,感觉效果并不太好。我很想听听大家对英译中的评价。

Steve Souders讲的一些工具和方法比较有意思。章文嵩的绿色计算也是亮点。我听了淘宝叔度和清无的基于nginx的Tengine的介绍。这个与我是个亮点,因为我感觉互联网服务器是一个不管大小公司都要有的东西,而Apache和lighthttpd好像都有点过气,其笨重和抗压的表现都不尽人意。而nginx和在其基础上提高的Tengine确实让我眼前一亮。我加入了Tengine的邮件列表,感到其人气,特别是在中文圈里,在慢慢上升,这是一个可喜的现象,我对其非常看好。

顺便加一句,淘宝的开源软件可以在这里找到。很多东西看起来都比较有意思,像这个tsar。淘宝的博客也很棒,我订阅了淘宝核心系统团队博客淘宝共享数据平台博客。另外很多淘宝员工的个人博客也很棒,特别是如果你也搞MySQL的话。像苏普的这个Perl脚本就很管用,而江枫的Flashcache介绍和讨论也会很有启发。stronghearted的博客褚霸的博客也非常好。

星期二晚上有机会和叔度、冯景辉长谈,海阔天空,保罗万象,聊得很开心。靠,叔度和景辉很牛逼幽默。从叔度那里,我不得不由衷地得出如下结论:山东人不简单啊,好得很!凌晨左右,多谢景辉,我们到创新工厂参观。我喜欢和欣赏李开复干出来的事和其影响力,能到创新工厂看看很不错。

另外豆瓣刘洪清的MapReduce分享也很有料。在会场外也和出版界及网上IT社区的一些编辑做了交流,很有意思。和苏普和江枫有了更深的关于MySQL的交流。

感谢苏普、淘穆公、江枫、和吴炳锡的邀请和盛情款待。星期天晚上刚坐了14个小时的飞机和堵了两个多小时的出租,疲惫不堪,没胃口享受云南真菌火锅的美味。但和Virident的Leon一起吃了非常美味的东北菜。谢谢Leon!和jackbillow还有hellodba的交流,听他们的环境和应用也很有意思。还认识了很多其他人,就不多说了,名字可是记不太清,因为大家有微薄的帐号,淘宝的还有武侠花名。

——

Velocity里谈到的东西大都和开源软件有关。我以前和国内的刘忠武一起做过关于数据库测试的开源软件,AnyDbTest。这个软件绝大部分是忠武老弟用C#写成。忠武是个很强悍的程序猿,C#,Java,Oracle,SQL Server,Python,Linux Shell脚本等都拿得起,放得下,想法周到、细致、全面,生猛异常!

——

我12月4日星期天下午到,北京城正被大雾和空气污染所覆盖。到大会期间,天气已经好转,一片晴朗。星期三那天,从旅馆11楼往下看,看到附近学校里有体育课在进行。当时看到年轻的学生们能在日光和蓝天下在学校的操场上跑步,感觉真好,那前两天的坏天气和疲乏所带来的阴郁感也一扫而光。周一中午一个人吃了海底捞,很不错。海底捞的服务态度那么好,希望也能给其他商家店面的客服方面带来正面连锁效应。周三晚上会议结束后有很短的时间,我顺着远大路往西走,到四环后右转往北走了走,看到遛狗的,摆地摊的,然后到路边的一个小馆要了个蒜苔炒肉盖饭和一小碟芹菜花生米,很不错。这么吃着,走着,看行人和小饭馆的顾客,读着我同胞的表情、动作和喜怒哀乐,聆听他/她们的标准和不太标准的普通话,猜测和想像他/她们的生活状况,很好。

吃北京咸菜也是亮点,那腌制的藕片和花生米,嚼起来脆生生的,特过瘾。我在王府井那个天主教堂的前面小广场上看到在晚上,人们在一起跳国标和恰恰恰,也很不错。星期天早上我逛到了东四清真寺,但那个地方谢绝参观,有点遗憾。在那天飞回美国之前,买了几本书回来看。

第一次坐高铁,感觉很不错,最高时速差不多310公路左右。以前北京到枣庄十来个小时的车程,现在两小时稍微多一点就搞定。虽然铁路还有问题,但总起来讲很牛啊!

Comments (2)

Generating dimension data for dates

Most analytical and BI databases have date dimension table(s). One frequently needs to generate and populate such data. I present a solution below for such data generation, written in Python. Please use different database drivers/modules to connect to your specific database server (MySQL, SQL Server, Oracle, etc.) for data population.

Notes:

1. It takes 2 parameters, start date and end date, in YYYYMMDD format, inclusive. Extensive error checking is built in, but let me know if you have comments/suggestions;

2. The script produce a Python dictionary (associated array) and print out its content;

3. The output includes dayNumber: a day’s position in a year. For example, 2011-02-01 is the 32ed day in 2011, therefore its dayNumber is 32;

4. The output includes weekNumber: a week’s position in a year. The week number in year is based on ISO standard. From documentation: the ISO year consists of 52 or 53 full weeks, where a week starts on a Monday and ends on a Sunday. The first week of an ISO year is the first (Gregorian) calendar week of a year containing a Thursday. This is called week number 1, and the ISO year of that Thursday is the same as its Gregorian year.

So, 2011-01-01 has the weekNumber 52, because it falls on a Saturday and belongs to the last week of 2010.

5. The output includes weekday information as well. 4 different variations are included:
Sunday 0, Monday 1, and so on
Sunday 1, Monday 2, and so on
Monday 0, Tuesday 1, and so on
Monday 1, Tuesday 2, and so on

6. The script requires the argparse module. It comes with Python 2.7. Python version prior to 2.7 does not have it by default, therefore you need to install it.

import argparse, sys, time
from datetime import date, timedelta
parser = argparse.ArgumentParser(description="Generating date dimension data")
parser.add_argument('-s', '--startDate', help='Start date in YYYYMMDD format', required=True, dest='startDate')
parser.add_argument('-e', '--endDate', help='end date in YYYYMMDD format', required=True, dest='endDate')
argList = parser.parse_args()
if (((not argList.startDate.isdigit()) or (not (len(argList.startDate) == 8))) or ((not argList.endDate.isdigit()) or (not (len(argList.endDate) == 8))) or (argList.startDate > argList.endDate)):
	print "Input(s) must be numeric in YYYYMMDD format and end date must not be earlier than start date"
	sys.exit (1)
try:
	startDate = date(int(argList.startDate[0:4]), int(argList.startDate[4:6]), int(argList.startDate[6:8]))
	endDate = date(int(argList.endDate[0:4]), int(argList.endDate[4:6]), int(argList.endDate[6:8]))
except ValueError:
	print "Input(s) must be valid date value in YYYYMMDD format"
	sys.exit (1)
start = time.time()
while startDate <= endDate:
	dateInfo = {'dateYYYYMMDD': startDate.strftime('%Y%m%d'), 'calDate': startDate.strftime('%Y-%m-%d'), 'calDay': startDate.day, 'calMonth': startDate.month, 'calYear': startDate.year}
	dateInfo['dayOfWeekSunday0Monday1'] = startDate.isoweekday() % 7
	dateInfo['dayOfWeekSunday1Monday2'] = startDate.isoweekday() % 7 + 1
	dateInfo['dayOfWeekSunday6Monday0'] = startDate.weekday()
	dateInfo['dayOfWeekSunday7Monday1'] = startDate.isoweekday()
	dateInfo['dayNumber'] = startDate.toordinal() - date(startDate.year - 1, 12, 31).toordinal()
	dateInfo['weekNumber'] = startDate.isocalendar()[1]
	print dateInfo
	startDate = startDate + timedelta(1)

Comments (1)

Sysinternals and PAL

Sysinternals and PAL (Performance Analysis of Logs) are two fantastic tools for general server information gathering and troubleshooting on Windows.

Sysinternals suite is a set of tools that can be downloaded freely from Microsoft. One thing that is particularly attractive about them is that they can be run directly after downloading without special installation and all the footprints a typical installation leaves on the host machine (new directories under C:\Program Files\, registry entries, data files and what have you). I found them very valuable and handy.

In particular, psInfo provides good summary information of the server. For example, psinfo -s -h -d provides basic information about the system, software installed, Windows hot fixes installed, and disk volume information.

PAL: install PAL on your test/analysis/general purpose machine. Install the mschart control as it is a prerequisite of PAL. Here is how I used it:

1. Produce Perfmon data gathering template files using PAL. I exported 3 template files: overview, quick overview, and SQL Server 2005/2008;

Perfmon is the general-purpose data instrumentation tool on Windows. Through Perfmon you can gather system wide counters for things like CPU, memory, network, and disk IO. In addition, a lot of applications such as SQL Server, Exchange, and others, expose application level Instrumentation data such that you can collect them via Perfmon as well.

It is best to have a few handy data collection template, hence this step.

2. On the Windows server that I am interested in monitoring, import Perfmon counter template file produced above by opening a DOS prompt under Administrator and executing:

logman import -n templateNameIdefine -xml pathAndName2TemplateXmlFile

3. Open Perfmon, find the one you imported, and start collecting

4. After collection is done, copy the log file and use PAL for analysis. It will generate a very nice and intuitive report. Please don’t run PAL on the system you are diagnosing. Run it somewhere else. Be patient, as it will take a while for PAL to churn through the data (it took 2 hours on a Rackspace cloud server with 2 CPUs and 1 gig of RAM for a file about 30 meg)

Comments

A comparison of HandlerSocket and mysql client libraries with Python

I’ve done some benchmark testing of 2 Python modules for MySQL data retrieval: MySQLdb and pyhs. MySQLdb uses MySQL’s client libraries, whereas pyhs uses HandlerSocket that bypasses MySQL’s client layer and interfaces Innodb storage engine’s files directly. In my testing, HandlerSocket results in 82% improvement over mysql client libraries based on number of rows retrieved. The tests were conducted under different conditions: right after a start when cache is cold, a warmed up cache after running SELECT * FROM customer, and alternating the execution order of those 2 Python files. The results are fairly consistent in that they all fall in the same range. Below is a sample output.

root@ubuntu:~# python hanSolo.py
Using HandlerSocket, below is a report of how many customer's name and address can be retrieved based on customer key:
Seconds elapsed:  61.0000810623
Rows retrieved:  509863
root@ubuntu:~# python mclient.py
Using mysql client libraries, below is a report of how many customer's name and address can be retrieved based on customer key:
Seconds elapsed:  61.0001530647
Rows retrieved:  280120

Here is my setup:

Hardware and software:
1. Rackspace Cloud server Ubuntu 10.04 Lucid Lynx, 1 gig memory, 40 gig hard disk space, 64-bit
Linux ubuntu 2.6.35.4-rscloud #8 SMP Mon Sep 20 15:54:33 UTC 2010 x86_64 GNU/Linux

2. Following instruction here to get Percona’s APT repository;

3. apt-get install percona-server-client-5.5

4. apt-get install percona-server-server-5.5

Enable HandlerSocket plugin. HandlerSocket is bundled with Percona Server 5.5, so you don’t have to download source files, config, make, build yourself:

1. mysql> install plugin handlersocket soname 'handlersocket.so'

2 cp /usr/share/mysql/my-large.cnf /etc/mysql/my.cnf

3. vim /etc/mysql/my.cnf with the following under mysqld section

loose_handlersocket_port = 9998
loose_handlersocket_port_wr = 9999
loose_handlersocket_threads = 16
loose_handlersocket_threads_wr = 1
open_files_limit = 65535

4. service mysql restart

Acquire Python’s MySQLdb and pyhs modules:

1. apt-get install libmysqlclient-dev
Necessary for building Python’s MySQLdb

2. apt-get install python-dev
Necessary Python header files, gcc, make, etc., for building Python modules

2. wget the appropriate egg from this page. Get the one for your version of Python.

3. sh eggFileDownloadedFromTheStepAbove

4. easy_install MySQL-python
MySQLdb module, which uses mysql client for MySQL access

5. easy_install python-handler-socket

Prepare testing data

1. Follow instructions here to get dbgen compiled;

2. While at the proper directory, run
./dbgen -v -T c
It will generate a customer file that has 150000 rows

3. Create customer table in test. Here is the DDL:

CREATE TABLE customer ( C_CUSTKEY     INTEGER NOT NULL,
C_NAME        VARCHAR(25) NOT NULL,
C_ADDRESS     VARCHAR(40) NOT NULL,
C_NATIONKEY   INTEGER NOT NULL,
C_PHONE       CHAR(15) NOT NULL,
C_ACCTBAL     DECIMAL(15,2)   NOT NULL,
C_MKTSEGMENT  CHAR(10) NOT NULL,
C_COMMENT     VARCHAR(117) NOT NULL,
primary key (C_CUSTKEY));

4. load data local infile ‘/root/dbgen/customer.tbl’ into table customer fields terminated by ‘|’ lines terminated by ‘\n’;
Adjust file location as necessary.

Finally, here is the content of my throwaway Python test scripts. One highlight of mclient.py is it demonstrate how to return results back in as dict with MySQLdb.

hanSolo.py

import time
from pyhs import Manager
hs = Manager()
start = time.time()
i = 1
j= 0
while i < 150000:
	data = hs.get('test', 'customer', ['C_CUSTKEY', 'C_NAME', 'C_ADDRESS'], '%s' % i)
	i=i+1
	if i == 150000:
		i = 1
	end = time.time()
	j = j + 1
	if int(end - start) > 60:
		break
print "Using HandlerSocket, below is a report of how many customer's name and address can be retrieved based on customer key:"
print "Seconds elapsed: ", str(end - start)
print "Rows retrieved: ", str(j)

mclient.py

import sys, MySQLdb, time
my_host = "localhost"
my_user = "root"
my_pass = ""
my_db = "test"
try:
    db = MySQLdb.connect(host=my_host, user=my_user, passwd=my_pass, db=my_db)
except MySQLdb.Error, e:
     print "Error %d: %s" % (e.args[0], e.args[1])
     sys.exit (1)
cursor = db.cursor (MySQLdb.cursors.DictCursor)
i=1
j=0
start = time.time()
while i < 150000:
	sql = "select c_custkey, c_name, c_address from customer where c_custkey=%s" % i;
	cursor.execute(sql)
	results = cursor.fetchall()
	i=i+1
	if i==150000:
		i=1
	end = time.time()
	j=j+1
	if int(end - start) > 60:
		break
print "Using mysql client libraries, below is a report of how many customer's name and address can be retrieved based on customer key:"
print "Seconds elapsed: ", str(end - start)
print "Rows retrieved: ", str(j)
db.close()

Comments

Installing Perl DBI and DBD-mysql on Windows 64 bit

I had trouble getting Perl DBI and DBD-mysql on Windows in the past. In addition, on Windows 64-bit, you sometimes see recommendations of using 32-bit Perl.

Today I got to test the latest 64-bit ActiveState Perl distro for Windows, version 5.12.3.1204. I tested it on Windows 2008 R2 64-bit. I am happy to report that it works. I am not categorically recommend FOR the installation of 64-bit Perl on Windows, though.

Here are the steps:
1. Get the ActiveState Perl 64-bit package for Windows and install it, following all the default options;
2. On command prompt, do:
cd c:\perl64\bin
ppm install DBI
ppm install DBD-mysql

I then tested against both Oracle’s MySQL 5.5 Community Server and MariaDb’s 5.2.7 on Windows with MaatKit’s mk-table-checksum to confirm. And it worked fine:

C:\Users\Administrator\Downloads\maatkit-7540\maatkit-7540\bin>c:\Perl64\bin\perl.exe mk-table-checksum –databases mysql h=localhost,u=root,p=password

Comments

Finding long running INNODB transactions

Notes:
1. The script prints out elapsed time since transaction started, MySQL thread id, and the kill statement for transactions running longer than a defined threshold value, in seconds. Just copy, paster, and then execute the kill statement if you want to terminate the long transaction(s);
2. Adjust shellCmd variable;
3. Adjust longRunningThreshold value as needed. It is measured in seconds;
4. No special libraries/modules needed, as long as there is a working mysql client;
5. re module is used for regex processing. Good place to find examples of regular expression search and grouping. A status variable is used to assist locating MySQL thread id once a transaction running longer than the defined threshold is found.

import re, shlex, subprocess
def runCmd(cmd):
    proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
    out, err = proc.communicate()
    ret = proc.returncode
    return (ret, out, err)
shellCmd = """mysql -h hostName -e "show innodb status\\G""""
longRunningThreshold = 600
returnCode, stdOut, stdErr = runCmd(shellCmd)
targetTransactionFound = False
for line in stdOut.split('\n'):
    if targetTransactionFound:
        match = re.search(r'^MySQL\sthread\sid\s(\d+),', line)
        if match:
            print 'MySQL thread id', match.group(1), ' has been running for ', secondsTransactionElapsed, ' seconds'
            print 'To kill it, run: kill', match.group(1)
            targetTransactionFound = False
    else:
        match = re.search(r'^---TRANSACTION\s\w+,\sACTIVE\s(\d+)\ssec', line)
        if match:
            if (long(match.group(1)) > longRunningThreshold):
                targetTransactionFound = True
                secondsTransactionElapsed = match.group(1)

Comments (3)

Some SQL Server 2008 page compression observations

A few days ago I wrote about Infobright’s column-based storage engine, and compared the sizes of raw text data file, gzipped file, MyISAM files, and Infobright files. At that time, I also wanted to compare that against data compression in SQL Server 2008, which is a new feature. But the Windows cloud server instance I fired up at the time didn’t have enough disk space, so I temporarily aborted that endeavour, until today.

Once again, testing data was generated using TPC-H’s dbgen tool. In fact I took the same steps outlined here. The total raw text file size is around 8.8 GB. I then created 2 SQL Server tables without any index (heap), one without compression and one with page compression. The DDL used is based on DDL listed in this post, without the indexes.

I used BULK INSERT for data loading. Here is the statement used:

BULK INSERT [testDb].[dbo].[LINEITEM]
FROM 'C:\Users\Administrator\Documents\lineitem.tbl'
WITH(CHECK_CONSTRAINTS,CODEPAGE='RAW',DATAFILETYPE='char',
FIELDTERMINATOR='|',ROWTERMINATOR='0x0a')

I then calculated storage space taken with sp_spaceused stored procedure.

Here is the results in GB:

Raw Text: 8.8
Raw Text After GZIP: 2.6
Uncompressed SQL Server table data size: 9.5
Compressed SQL Server table data size: 7.4

For clarity’s sake, here is the results with Infobright and MyISAM on Linux:
MyISAM File Size: 7.2
InfobrightFileSize: 1.5

SQL Server 2008 provides a system stored procedure that gives a size estimation if compression is used for a non-compressed table/index.The estimated size of a compressed table is 5.4 GB. Comparing it with 7.4, the estimation appears to be optimistic, in this case.

Comments

Some notes and observations on ICE storage engine

I’ve used Vertica, a commercial column-based database storage engine, and was reasonably impressed. During O’Reilly MySQL conference last month, I checked out Infobright’s vendor booth and talked with some users. I became curious and wanted to test it out. Infobright has a free community version (ICE Infobright Community Edition)of its column-based storage engine that works with MySQL, which was what I used for my testing. I have no relationship with Infobright whatsoever, I happen to think that column-based storage can be a potentially disruptive technology in the BI/DW field. I’d love to hear your comments/experiences.

Here are some noteworthy points:

1. Setup is pretty easy. You can follow steps here. Note that the package has most relavant MySQL tools. A separate install of mysql client and server is NOT needed.

The ICE package has the following storage engines bundled:
BRIGHTHOUSE
MRG_MYISAM
CSV
MyISAM
MEMORY

2. I used TPC-H’s dbgen tool to generate data for testing. The raw text file is around 8.8 GB, about 72 million rows;

3. I used Rackspace’s cloud server, CentOS 5.5, 1 GB memory, 64-bit, 35 GB of hard drive space for testing. I created 2 databases, each with one table called lineitem: one table uses the BRIGHTHOUSE storage engine, the other uses the MyISAM storage engine. No index on the MyISAM table initially;

4. On this particular Rackspace server, below is the record of how long it took to load that amount of data into BRIGHTHOUSE table:

# time mysql-ib infobright < load.sql
real	22m46.974s
user	0m2.320s
sys	0m16.140s

And here is the record of how long it takes to load into the MyISAM table:

# time mysql-ib test < load.sql
real	6m11.966s
user	0m1.960s
sys	0m14.420s

Here is what’s inside load.sql:

load data local infile '/root/dbgen/lineitem.tbl' into table lineitem fields terminated by '|' lines terminated by '\n';

5. Size comparison, in GB. Here you can see the power of the impressive compression rate of a column-based storage engine:
Raw Text: 8.8
Raw Text After GZIP: 2.6
MyISAM File Size: 7.2
InfobrightFileSize: 1.5

6. I did a rudimentary performance comparison. The first one is on Infobright table, the second on MyISAM table:

mysql> select count(*) from lineitem where l_shipdate between '1993-01-01' and '1995-01-01';
+----------+
| count(*) |
+----------+
| 21880025 |
+----------+
1 row in set (15.06 sec)
mysql> use test;
Database changed
mysql> select count(*) from lineitem where l_shipdate between '1993-01-01' and '1995-01-01';
+----------+
| count(*) |
+----------+
| 21880025 |
+----------+
1 row in set (1 min 9.23 sec)

I then created an index on l_shipdate for the MyISAM table, and improved the performance to a bit more than 10 seconds.

Comments (2)

Next entries » · « Previous entries

Page optimized by WP Minify WordPress Plugin