I’ve used Vertica, a commercial column-based database storage engine, and was reasonably impressed. During O’Reilly MySQL conference last month, I checked out Infobright’s vendor booth and talked with some users. I became curious and wanted to test it out. Infobright has a free community version (ICE Infobright Community Edition)of its column-based storage engine that works with MySQL, which was what I used for my testing. I have no relationship with Infobright whatsoever, I happen to think that column-based storage can be a potentially disruptive technology in the BI/DW field. I’d love to hear your comments/experiences.
Here are some noteworthy points:
1. Setup is pretty easy. You can follow steps here. Note that the package has most relavant MySQL tools. A separate install of mysql client and server is NOT needed.
The ICE package has the following storage engines bundled:
BRIGHTHOUSE
MRG_MYISAM
CSV
MyISAM
MEMORY
2. I used TPC-H’s dbgen tool to generate data for testing. The raw text file is around 8.8 GB, about 72 million rows;
3. I used Rackspace’s cloud server, CentOS 5.5, 1 GB memory, 64-bit, 35 GB of hard drive space for testing. I created 2 databases, each with one table called lineitem: one table uses the BRIGHTHOUSE storage engine, the other uses the MyISAM storage engine. No index on the MyISAM table initially;
4. On this particular Rackspace server, below is the record of how long it took to load that amount of data into BRIGHTHOUSE table:
[sourcecode language=”text”]
# time mysql-ib infobright < load.sql
real 22m46.974s
user 0m2.320s
sys 0m16.140s
[/sourcecode]
And here is the record of how long it takes to load into the MyISAM table:
[sourcecode language="text"]
# time mysql-ib test < load.sql
real 6m11.966s
user 0m1.960s
sys 0m14.420s
[/sourcecode]
Here is what's inside load.sql:
[sourcecode language="text"]
load data local infile '/root/dbgen/lineitem.tbl' into table lineitem fields terminated by '|' lines terminated by '\n';
[/sourcecode]
5. Size comparison, in GB. Here you can see the power of the impressive compression rate of a column-based storage engine:
Raw Text: 8.8
Raw Text After GZIP: 2.6
MyISAM File Size: 7.2
InfobrightFileSize: 1.5
6. I did a rudimentary performance comparison. The first one is on Infobright table, the second on MyISAM table:
[sourcecode language="text"]
mysql> select count(*) from lineitem where l_shipdate between ‘1993-01-01’ and ‘1995-01-01’;
+———-+
| count(*) |
+———-+
| 21880025 |
+———-+
1 row in set (15.06 sec)
mysql> use test;
Database changed
mysql> select count(*) from lineitem where l_shipdate between ‘1993-01-01’ and ‘1995-01-01’;
+———-+
| count(*) |
+———-+
| 21880025 |
+———-+
1 row in set (1 min 9.23 sec)
[/sourcecode]
I then created an index on l_shipdate for the MyISAM table, and improved the performance to a bit more than 10 seconds.
2 responses to “Some notes and observations on ICE storage engine”
I used ICE at my last gig and I did quite a bit of testing of our data before we put it in production. I never tested it in a cloud environment and I found that the load was super painful if it wasn’t on fast disks. I put both mysql & ICE on similar hardware, same OS.
In the end we changed the schema so no text what so ever went into the tables. NONE. Everything was an int or a decimal. This totally sped up the queries DRAMATICALLY.
Due to diskspace limitations I never benchmarked MySQL past 2 Million rows in the table. But I did go on to test ICE with data upto 10BILLION rows & it didn’t fall down.
I love ICE. But I do have some complaints:
1) no read while loading.
2) no replication.
If I can get around #1 I would consider using it at my new gig.
Between the data compression and the speed of queries ICE was/is extremely attractive. We had a dataset (granted with text) that had 350 million rows. BI query took 2.5 DAYS. moved it to all ints schema & on ICE it took 10.0 secs the 1st time I ran it & .006 after that! it was impressive. Now I probably could have gotten MySQL down to 8 hours with a tighter schema but that was still too long.
YMMV
erin
Thanks Erin for the comment. It is great to hear candid discussions from real experience.
I didn’t know the “no read while loading” limitation. I will check that out.