I’ve used Vertica, a commercial column-based database storage engine, and was reasonably impressed. During O’Reilly MySQL conference last month, I checked out Infobright’s vendor booth and talked with some users. I became curious and wanted to test it out. Infobright has a free community version (ICE Infobright Community Edition)of its column-based storage engine that works with MySQL, which was what I used for my testing. I have no relationship with Infobright whatsoever, I happen to think that column-based storage can be a potentially disruptive technology in the BI/DW field. I’d love to hear your comments/experiences.
Here are some noteworthy points:
1. Setup is pretty easy. You can follow steps here. Note that the package has most relavant MySQL tools. A separate install of mysql client and server is NOT needed.
The ICE package has the following storage engines bundled:
BRIGHTHOUSE
MRG_MYISAM
CSV
MyISAM
MEMORY
2. I used TPC-H’s dbgen tool to generate data for testing. The raw text file is around 8.8 GB, about 72 million rows;
3. I used Rackspace’s cloud server, CentOS 5.5, 1 GB memory, 64-bit, 35 GB of hard drive space for testing. I created 2 databases, each with one table called lineitem: one table uses the BRIGHTHOUSE storage engine, the other uses the MyISAM storage engine. No index on the MyISAM table initially;
4. On this particular Rackspace server, below is the record of how long it took to load that amount of data into BRIGHTHOUSE table:
# time mysql-ib infobright < load.sql real 22m46.974s user 0m2.320s sys 0m16.140s [/sourcecode] And here is the record of how long it takes to load into the MyISAM table: [sourcecode language="text"] # time mysql-ib test < load.sql real 6m11.966s user 0m1.960s sys 0m14.420s [/sourcecode] Here is what's inside load.sql: [sourcecode language="text"] load data local infile '/root/dbgen/lineitem.tbl' into table lineitem fields terminated by '|' lines terminated by '\n'; [/sourcecode] 5. Size comparison, in GB. Here you can see the power of the impressive compression rate of a column-based storage engine: Raw Text: 8.8 Raw Text After GZIP: 2.6 MyISAM File Size: 7.2 InfobrightFileSize: 1.5 6. I did a rudimentary performance comparison. The first one is on Infobright table, the second on MyISAM table: [sourcecode language="text"] mysql> select count(*) from lineitem where l_shipdate between '1993-01-01' and '1995-01-01'; +----------+ | count(*) | +----------+ | 21880025 | +----------+ 1 row in set (15.06 sec) mysql> use test; Database changed mysql> select count(*) from lineitem where l_shipdate between '1993-01-01' and '1995-01-01'; +----------+ | count(*) | +----------+ | 21880025 | +----------+ 1 row in set (1 min 9.23 sec)
I then created an index on l_shipdate for the MyISAM table, and improved the performance to a bit more than 10 seconds.
I used ICE at my last gig and I did quite a bit of testing of our data before we put it in production. I never tested it in a cloud environment and I found that the load was super painful if it wasn’t on fast disks. I put both mysql & ICE on similar hardware, same OS.
In the end we changed the schema so no text what so ever went into the tables. NONE. Everything was an int or a decimal. This totally sped up the queries DRAMATICALLY.
Due to diskspace limitations I never benchmarked MySQL past 2 Million rows in the table. But I did go on to test ICE with data upto 10BILLION rows & it didn’t fall down.
I love ICE. But I do have some complaints:
1) no read while loading.
2) no replication.
If I can get around #1 I would consider using it at my new gig.
Between the data compression and the speed of queries ICE was/is extremely attractive. We had a dataset (granted with text) that had 350 million rows. BI query took 2.5 DAYS. moved it to all ints schema & on ICE it took 10.0 secs the 1st time I ran it & .006 after that! it was impressive. Now I probably could have gotten MySQL down to 8 hours with a tighter schema but that was still too long.
YMMV
erin
Thanks Erin for the comment. It is great to hear candid discussions from real experience.
I didn’t know the “no read while loading” limitation. I will check that out.