Some notes and observations on ICE storage engine

I’ve used Vertica, a commercial column-based database storage engine, and was reasonably impressed. During O’Reilly MySQL conference last month, I checked out Infobright’s vendor booth and talked with some users. I became curious and wanted to test it out. Infobright has a free community version (ICE Infobright Community Edition)of its column-based storage engine that works with MySQL, which was what I used for my testing. I have no relationship with Infobright whatsoever, I happen to think that column-based storage can be a potentially disruptive technology in the BI/DW field. I’d love to hear your comments/experiences.

Here are some noteworthy points:

1. Setup is pretty easy. You can follow steps here. Note that the package has most relavant MySQL tools. A separate install of mysql client and server is NOT needed.

The ICE package has the following storage engines bundled:
BRIGHTHOUSE
MRG_MYISAM
CSV
MyISAM
MEMORY

2. I used TPC-H’s dbgen tool to generate data for testing. The raw text file is around 8.8 GB, about 72 million rows;

3. I used Rackspace’s cloud server, CentOS 5.5, 1 GB memory, 64-bit, 35 GB of hard drive space for testing. I created 2 databases, each with one table called lineitem: one table uses the BRIGHTHOUSE storage engine, the other uses the MyISAM storage engine. No index on the MyISAM table initially;

4. On this particular Rackspace server, below is the record of how long it took to load that amount of data into BRIGHTHOUSE table:

# time mysql-ib infobright < load.sql 

real	22m46.974s
user	0m2.320s
sys	0m16.140s
&#91;/sourcecode&#93;

And here is the record of how long it takes to load into the MyISAM table:

&#91;sourcecode language="text"&#93;
# time mysql-ib test < load.sql 

real	6m11.966s
user	0m1.960s
sys	0m14.420s
&#91;/sourcecode&#93;

Here is what's inside load.sql:

&#91;sourcecode language="text"&#93;
load data local infile '/root/dbgen/lineitem.tbl' into table lineitem fields terminated by '|' lines terminated by '\n';
&#91;/sourcecode&#93;

5. Size comparison, in GB. Here you can see the power of the impressive compression rate of a column-based storage engine:
Raw Text: 8.8
Raw Text After GZIP: 2.6
MyISAM File Size: 7.2
InfobrightFileSize: 1.5

6. I did a rudimentary performance comparison. The first one is on Infobright table, the second on MyISAM table:

&#91;sourcecode language="text"&#93;
mysql> select count(*) from lineitem where l_shipdate between '1993-01-01' and '1995-01-01';
+----------+
| count(*) |
+----------+
| 21880025 |
+----------+
1 row in set (15.06 sec)

mysql> use test;
Database changed
mysql> select count(*) from lineitem where l_shipdate between '1993-01-01' and '1995-01-01';
+----------+
| count(*) |
+----------+
| 21880025 |
+----------+
1 row in set (1 min 9.23 sec)

I then created an index on l_shipdate for the MyISAM table, and improved the performance to a bit more than 10 seconds.

2 thoughts on “Some notes and observations on ICE storage engine”

  1. I used ICE at my last gig and I did quite a bit of testing of our data before we put it in production. I never tested it in a cloud environment and I found that the load was super painful if it wasn’t on fast disks. I put both mysql & ICE on similar hardware, same OS.

    In the end we changed the schema so no text what so ever went into the tables. NONE. Everything was an int or a decimal. This totally sped up the queries DRAMATICALLY.

    Due to diskspace limitations I never benchmarked MySQL past 2 Million rows in the table. But I did go on to test ICE with data upto 10BILLION rows & it didn’t fall down.

    I love ICE. But I do have some complaints:
    1) no read while loading.
    2) no replication.

    If I can get around #1 I would consider using it at my new gig.

    Between the data compression and the speed of queries ICE was/is extremely attractive. We had a dataset (granted with text) that had 350 million rows. BI query took 2.5 DAYS. moved it to all ints schema & on ICE it took 10.0 secs the 1st time I ran it & .006 after that! it was impressive. Now I probably could have gotten MySQL down to 8 hours with a tighter schema but that was still too long.

    YMMV
    erin

  2. Thanks Erin for the comment. It is great to hear candid discussions from real experience.

    I didn’t know the “no read while loading” limitation. I will check that out.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.